<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:138px; top:120px; width:334px; height:16px;"><span style="font-family: VDCSKW+SimHei; font-size:16px">“智能政务”中的文本挖掘：原理、实现与应用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:268px; top:140px; width:74px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">Woodbird Zhuo
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:193px; width:468px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">摘要：</span><span style="font-family: ASLVIP+SimSun; font-size:10px">本文主要讨论文本挖掘中的文本分类、热门问题挖掘以及文本相关性、完整性和可读性评价的原理、
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:209px; width:468px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算法与实现。文章采用了二元语法、词袋模型、卡方检验的方法，结合机器学习实现了文本自动分类；为了
<br>挖掘热门问题，文章采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的方法对留言进行聚类，并通过留言的点赞数、反对数和时间跨度挖掘
<br>出热门问题；为了给答复打分，文章结合词向量和关键句提取算法，度量了答复的相关性。再根据二元语法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:255px; width:468px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">与字典匹配，度量了答复的局部整体性与可读性。最后，结合两者即可评价留言答复的质量。为了贴近工程
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:178px; top:271px; width:259px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">实际，在每个问题最后，文章列举了某些底层的优化实现。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:118px; top:286px; width:328px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">关键词：</span><span style="font-family: ASLVIP+SimSun; font-size:10px">文本分类；机器学习；二元语法；词袋模型；文本聚类；词向量；
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:123px; top:321px; width:364px; height:22px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:22px">Text Mining in Intelligent Government Aﬀairs
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:80px; top:339px; width:451px; height:22px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:22px">Management: Principle, Implementation and Application
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:391px; width:447px; height:14px;"><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">Abstract </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">The present paper is mainly about the principle and implementation on texts classiﬁcation,
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:407px; width:465px; height:45px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">mining of hot spots as well as scoring of texts according their correlation, Integrality and readability. The
<br>paper adopts 2-gram, bag-of-words model and chi-square test to prepare the corpus. Then using machine
<br>learning algorithm, one could implement a classiﬁer of texts. To deal with the hot-spots mining, DBSCAN
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:77px; top:454px; width:457px; height:45px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">clustering is deployed to cluster similar texts as a class. Then coming up a scoring model based on the
<br>number of agrees, disagree and time span to decide the hot-spots. Eventually, to calculate the relevance
<br>among texts, this article uses key-sentence extracting and word2vec method to calculate the similarity of
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:86px; top:500px; width:439px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">texts. To scoring the integrality and readability, 2-gram and dictionary matching is adopted. Thus a
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:516px; width:467px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">scoring model based on both is proposed. In order to go in line with practice, this article will present some
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:547px; width:431px; height:14px;"><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">key words: </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">texts classiﬁcation;machine learning; 2-gram; bag-of-words; texts clustering; word2vec
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:193px; top:531px; width:225px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">low-level implementation and optimization method.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:594px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可以在已有的、分好类的留言详情语料库中，训练出
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:609px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">一个能够自动分类的机器学习模型。为了实现这一
<br>点，本文在前人工作的基础上 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，使用效果极佳的
<br>二元语法以及卡方检验，从而将非结构的文本转换为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:656px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">结构化的特征向量。之后，文章从机器学习常用的模
<br>型之中，采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折交叉验证、网格寻优的方法筛选
<br>出适合用于该问题的、常见的机器学习模型及其参
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:388px; top:699px; width:80px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">数，如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:152px; top:587px; width:66px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">1 </span><span style="font-family: VDCSKW+SimHei; font-size:12px">内容简介
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:616px; width:229px; height:60px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">从文本中挖掘有效信息，是自然语言处理（</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP</span><span style="font-family: ASLVIP+SimSun; font-size:10px">）领
<br>域的重要问题。一般地，基于规则的挖掘算法已经在
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">19 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">世纪 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">50 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">年代遭遇挫折。因此，在文本挖掘问题
<br>中，业界通常采用基于统计方法的机器学习，和隶属
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:681px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">机器学习、使用神经网络模型的深度学习。前者一般
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:697px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">需要一个手工特征模板，对数据进行预处理。后者则
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:712px; width:225px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">不然，其类似于“黑盒子”，通过神经网络节点的训
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:728px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">练，即可自动地提取出信息。使用文本挖掘的方法，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:103px; top:743px; width:169px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">亦可以减轻网络问政工作人员的负担。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:762px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在处理群众留言时，首先要对留言进行归类。这一点
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:787px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">1
<br></span></div><span style="position:absolute; border: gray 1px solid; left:0px; top:892px; width:612px; height:792px;"></span>
<div style="position:absolute; top:892px;"><a name="2">Page 2</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:137px; top:1100px; width:97px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">1. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">问题一解题思路
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1140px; width:229px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">同时，在附录中额外给出了神经网络的训练过程与模
<br>型效果。之后，以 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">F1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值评价训练好的模型，并得出
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">适用于文本分类的模型有：贝叶斯分类器、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和
<br>逻辑回归。同时，进一步反映了神经网络无法提升文
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:138px; top:1203px; width:99px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本分类效果这一论断。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:1220px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了挖掘出热门问题，显然需要先对留言进行聚类，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1236px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此属于文本聚类的任务之一。考虑到二元语法得到
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1251px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的特征向量太过冗余，且聚类这种没有标签的无监督
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1267px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算法，无法采用卡方分布过滤特征。因此，本文采用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1282px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">条件随机场分词器，对留言详情构成的语料库进行分
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1298px; width:232px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">词。过滤掉停用词并在粗分的基础上进行合并后，再
<br>使用主成分分析法（</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA</span><span style="font-family: ASLVIP+SimSun; font-size:10px">）对数据进行降维。之后，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:86px; top:1329px; width:198px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将预处理后的数据，使用自适应聚类算法——
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:1341px; width:229px; height:29px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">进行聚类。从而将隶属与同一个问题的、
<br>相似的留言聚成一簇。之后，再根据留言的点赞数、
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:75px; top:1375px; width:220px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">反对数和时间跨度，建立一个问题的热度模型。最
<br>后，按照热度进行降序排序，即可找出 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">大热门问
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:178px; top:1406px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">题。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1424px; width:229px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">考虑到留言详情的字数较多，且有可能同一热门问题
<br>包含许多条留言。因此，本文还结合 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">TextRank </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法
<br>和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">BM25 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法，提取出文本中的关键句。再根据关
<br>键句，人工归纳出热门问题的问题描述。总体的做法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:1482px; width:60px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">见图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:965px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了评价留言答复的相关性，显然需要判断答复与留
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:980px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">言详情之间的相似度，并以此为依据度量相关性。因
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:996px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">此，本文采用了词向量的方法，结合两者的关键句，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1011px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">找出了答复与留言的相似度，从而评价相关性。为了
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1027px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">衡量留言的完整性与可读性，文章采用了二元语法匹
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1042px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">配的方法。扫描留言答复的同时，在字典中匹配当前
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:1058px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">两个字符，从而一定程度上度量了答复的局部完整
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1074px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">性、可读性。最后，综合上述因子，即可建立一个评
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:1089px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">分模型，对留言答复进行评分。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1105px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">文章在每一个问题最后，都或多或少地提及相应的底
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1121px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">层实现和优化方法，使用它们将降低算法的复杂度和
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1136px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">运算耗时。同时，在许多细节方面亦给出详细的处理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:1152px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">方法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:1164px; width:230px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">最后，文章总结所做的工作，提出了笔者对 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">领
<br>域的一些浅薄的见解。同时分析了文章的不足之处，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:360px; top:1195px; width:136px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">领域有待解决的问题。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:380px; top:1220px; width:90px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">2 </span><span style="font-family: VDCSKW+SimHei; font-size:12px">文本分类问题
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1247px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">根据已经归类好的群众留言数据，对未知类别的留言
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:1263px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">进行分类显然属于一个文本分类问题。对于分类问
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1278px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">题，使用机器学习的方法即可高效地解决。然而，要
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1294px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">使用机器学习模型实现自动分类，首先需要将非结构
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1309px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">化的文本数据转换成结构化的特征向量。考虑到留言
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:1325px; width:223px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">语料库中留言详情所包含的信息量，远大于留言主
<br>题。且很明显，留言时间与文本的类别毫不相关</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">1</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br>因此，</span><span style="font-family: VDCSKW+SimHei; font-size:10px">本文中的语料库均指所有留言的留言详情</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1372px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本节将展示使用二元语法词袋模型，将非结构化的文
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:1387px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本表示为向量。考虑到特征个数达到十万以上数量
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1403px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">级，若直接投入机器学习模型的训练中，显然会造成
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1418px; width:225px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">“维度灾难”。为此，文章采用卡方检验的方法，从
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1434px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">而过滤掉对分类结果影响不大的特征，从而进行特征
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:1450px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">降维。最后，文章将从多种常见的机器学习模型中，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1465px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">根据模型们在数据集中的表现，挑选出合适的模型及
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:403px; top:1481px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">模型参数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:390px; top:1503px; width:70px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">特征工程
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1529px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将语料库进行处理，从而转换为可供机器学习使用的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1544px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">模式即为特征工程，或</span><span style="font-family: VDCSKW+SimHei; font-size:10px">数据预处理</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。对于本例中的文
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:1560px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本分类任务，这里将结合卡方检测，使用二元语法词
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:137px; top:1568px; width:97px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">2. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">问题二解题思路
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:1575px; width:458px; height:28px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">袋模型，对语料库进行预处理。对于样本的类别，可
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">1</span><span style="font-family: ASLVIP+SimSun; font-size:9px">笔者已将时间处理为精确到分钟有序序列，并通过单因素方差分析法，使用随机抽样（共 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">1000 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">个样本）证明了这一
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:1605px; width:8px; height:9px;"><span style="font-family: ASLVIP+SimSun; font-size:9px">点
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:1629px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">2
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:86px; top:962px; width:198px; height:142px;"></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:86px; top:1512px; width:198px; height:60px;"></div><span style="position:absolute; border: black 1px solid; left:70px; top:1591px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:1734px; width:612px; height:792px;"></span>
<div style="position:absolute; top:1734px;"><a name="3">Page 3</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:102px; top:1803px; width:172px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">以直接将其转换为 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">0 </span><span style="font-family: VDCSKW+SimHei; font-size:10px">开始的有序整数</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:1807px; width:223px; height:11px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">B</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，则卡方检验在于验证 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">P </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">A</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) = </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">P </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">A</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">P </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:145px; top:1829px; width:80px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.1.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">二元语法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1855px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如上所述，为了将非结构化的文本转换为结构化的特
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1870px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">征向量，本文将采用二元语法词袋模型进行建模。所
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:347px; top:1871px; width:43px; height:11px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:13)</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n; c</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) =
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:318px; top:1837px; width:215px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">记卡方检验的检验统计量为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:13)</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，其计算公式如下</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">:
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:403px; top:1822px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">是否成立。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:393px; top:1834px; width:14px; height:37px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:421px; top:1834px; width:14px; height:37px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:398px; top:1886px; width:3px; height:6px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:409px; top:1881px; width:38px; height:14px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: FETXOR+CMSY7; font-size:12px">∈{</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">f</span><span style="font-family: SUGXJQ+CMMI5; font-size:4px">n</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">; </span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">f</span><span style="font-family: SUGXJQ+CMMI5; font-size:4px">n</span><span style="font-family: FETXOR+CMSY7; font-size:12px">}
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:450px; top:1856px; width:53px; height:21px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">E</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2
<br></span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:468px; top:1879px; width:15px; height:10px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">E</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:1868px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(1)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:1910px; width:230px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为特征 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">在属于类别 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的文档中出现的频
<br>数。</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">E</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为事件 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">A</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">同时出现的期望，可由如
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:404px; top:1941px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">下式算出：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:345px; top:1961px; width:98px; height:20px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">E</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:416px; top:1977px; width:8px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">N
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:447px; top:1961px; width:56px; height:17px;"><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:477px; top:1977px; width:8px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">N
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:1966px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(2)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:321px; top:1994px; width:209px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c </span><span style="font-family: ASLVIP+SimSun; font-size:10px">表示逻辑非，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为所有特征的频数</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">, </span><span style="font-family: ASLVIP+SimSun; font-size:10px">即
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2013px; width:229px; height:26px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">nc</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。由于 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:13)</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">服从卡方分
<br>布，根据所得值与卡方分布的表达式即可反推出概率
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:423px; top:2045px; width:5px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">p
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:2063px; width:235px; height:87px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">卡方检验的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">原假设</span><span style="font-family: ASLVIP+SimSun; font-size:10px">为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">P </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">A</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) = </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">P </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">A</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">P </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">c</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">成立，
<br>即待检验特征 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">对分类决策的帮助不大。取置信水
<br>平为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">001</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">, </span><span style="font-family: ASLVIP+SimSun; font-size:10px">也即检验犯一类错误的概率为 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">%</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br>于是，对语料库中的每一个特征，考虑将它们都进行
<br>卡方检验。若概率 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">p &lt; (cid:11)</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">, </span><span style="font-family: ASLVIP+SimSun; font-size:10px">则拒绝原假设，即认为该
<br>特征属于重要因子。反之，则接受原假设，此时即可
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:391px; top:2156px; width:73px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将该特征 </span><span style="font-family: VDCSKW+SimHei; font-size:10px">删除</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:320px; top:2172px; width:210px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">以附件 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">数据为例，其词袋模型的稀疏矩阵共有
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:2187px; width:230px; height:29px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">396287 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个特征。经过卡方检验的过滤后，降为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">30291
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">个特征，</span><span style="font-family: VDCSKW+SimHei; font-size:10px">压缩</span><span style="font-family: ASLVIP+SimSun; font-size:10px">到原来的不到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">10%</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。至此，数据预处理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:399px; top:2218px; width:57px; height:13px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">步骤结束 </span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">3</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:363px; top:2259px; width:125px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">文本分类模型的筛选
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2290px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">得到数据矩阵后，就可以通过机器学习的方法，根据
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2305px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">数据集训练出一个文本分类模型了。由于汉语言处理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2321px; width:229px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">文献较为缺乏，本文将从常见的机器学习模型中，筛
<br>选出最适合进行文本分类的模型。在附录 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">A </span><span style="font-family: ASLVIP+SimSun; font-size:10px">中，文
<br>章使用了 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">BP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">神经网络进行文本分类，并发现其较之
<br>普通机器学习而言，效果反而更差。因此本文将着重
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:328px; top:2383px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">采用机器学习的方法，解决文本分类的问题。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2402px; width:228px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于本文是根据 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">F </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值筛选出模型和参数的，为了表
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:321px; top:2417px; width:209px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">述方便，在后文中均用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">拟合优度均代指 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">F </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 </span><span style="font-family: VDCSKW+SimHei; font-size:10px">值。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:1886px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">谓二元语法，即将连续的两个汉字（过滤掉标点符
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:1901px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">号、制表符、换行符等）视为一个特征。例如句子
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:81px; top:1917px; width:209px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">“第八届泰迪杯比赛。”</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，其二元语法为</span><span style="font-family: AGCYGV+FangSong; font-size:10px">（“第
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1930px; width:229px; height:28px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">八”，“八届”，“届泰”，</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">...</span><span style="font-family: AGCYGV+FangSong; font-size:10px">，“杯比”，“比
<br>赛”）</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。为了过滤掉标点符号，可以考虑使用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">正向最
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:101px; top:1960px; width:173px; height:13px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">长匹配</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的方法</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">2</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，将标点符号进行过滤。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:1979px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">值得一提的是，将文本转换为特征向量可以考虑进行
<br></span><span style="font-family: VDCSKW+SimHei; font-size:10px">分词</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。然而根据郭志芃等老师的开源工作 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，这种
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">将文本中相邻两个字符作为特征，反而能够取得更好
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:168px; top:2026px; width:39px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的成绩。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:145px; top:2048px; width:80px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.1.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">词袋模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:2070px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在许多外文文献中，也称词袋模型为 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">BOW</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。词袋
<br>模型将语料库（经过二元语法提取后）的所有特征，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2105px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">构成一个的向量，并作为每一句留言（文档）的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">特征
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2120px; width:229px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。而文档中的特征向量的某个元素，其取值等于
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2136px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">相应特征在文档中出现的频数。至此，就将语料库转
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:138px; top:2152px; width:99px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">换为一个稀疏的矩阵。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:145px; top:2173px; width:80px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.1.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">卡方检验
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2200px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于词袋模型得到的往往是一个稀疏矩阵，若直接供
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2215px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">给机器学习模型训练，势必会出现“维度灾难”的问
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2231px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">题。以示例数据为例，经过二元语法与词袋模型的处
<br>理后，语料库转换为 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">9210 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">396287 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的畸形矩阵，即
<br>样本个体的特征个数近 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">40 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">万。但是，该矩阵中有绝
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:87px; top:2274px; width:200px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">大部分为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">元素，换句话说，矩阵是稀疏的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2293px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另一方面，由于许多常用的单词对分类决策的影响不
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2308px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">大，比如停用词和表述词等。再者，许多单词在所有
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2324px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">类别的样本中均频繁出现。因此，为了消除这些影响
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2339px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因素，这里考虑采用卡方检验的方法，过滤掉这些用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:153px; top:2355px; width:69px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">处不大的特征。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2371px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">类似于单因素方差分析，卡方检验通常由于判断两个
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:2386px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">随机事件是否相互独立。记语料库中的一个特征为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:2393px; width:227px; height:35px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; n </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">396287</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，事件“文档中存在 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: ASLVIP+SimSun; font-size:10px">”
<br>为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">A</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，事件“文档属于类别 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c; c </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">6</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">”为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:2432px; width:224px; height:23px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">2</span><span style="font-family: ASLVIP+SimSun; font-size:9px">具体细节详见</span><span style="font-family: JXSLWJ+LMRoman9-Bold; font-size:12px">??</span><span style="font-family: ASLVIP+SimSun; font-size:9px">小节
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">3</span><span style="font-family: ASLVIP+SimSun; font-size:9px">预处理后的数据可见附件：</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">data_q2_X_ﬁnal_data.pkl
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:2471px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">3
<br></span></div><span style="position:absolute; border: black 1px solid; left:450px; top:1878px; width:53px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:398px; top:1975px; width:46px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:458px; top:1975px; width:46px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:2433px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:2576px; width:612px; height:792px;"></span>
<div style="position:absolute; top:2576px;"><a name="4">Page 4</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:96px; top:2644px; width:178px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.2.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">交叉验证与网格寻优筛选模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2649px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">人工因素选择。因此，本文采用网格寻优法，从</span><span style="font-family: VDCSKW+SimHei; font-size:10px">参数
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2670px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">笔者认为，机器学习是一门理论的科学，亦是一门实
<br>践的艺术。因此，在 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">特别是汉语言处理这门比
<br>较新的领域，任何模型都不能随意地认定其优劣。考
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2716px; width:229px; height:57px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">虑到前人在这方面的研究较少，因此，本人将从逻辑
<br>回归</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">4</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、支持向量分类器（以下称 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: ASLVIP+SimSun; font-size:10px">）、决策树、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">k
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">近邻算法（以下简称 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN</span><span style="font-family: ASLVIP+SimSun; font-size:10px">）、朴素贝叶斯分类器、随
<br>机森林和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">中，筛选最合适的模型以及模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:173px; top:2779px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">参数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2794px; width:231px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在筛选模型之前，需要先筛选最佳的模型参数。上述
<br>模型中，带有参数的模型分别为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、决策树、
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、随机森林和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折交
<br>叉验证常用来评价一个模型在指定数据集中的优劣。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:118px; top:2853px; width:133px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其将数据集复制成 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">份，记为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:2863px; width:229px; height:65px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">D</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; K</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。同时将 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">D</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">按比例 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11)</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">% </span><span style="font-family: ASLVIP+SimSun; font-size:10px">拆分成
<br>训练集、测试集，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 100/</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">K</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。之后对于某一个模
<br>型，通过 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折训练集训练 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个分模型，并分别计算
<br>它们在相应的测试集中的拟合优度，并构成拟合优度
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:128px; top:2925px; width:119px; height:20px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">序列 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; K</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:131px; top:3116px; width:108px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">3. </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折交叉验证原理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:3147px; width:229px; height:44px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">根据序列 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的均值 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">, </span><span style="font-family: ASLVIP+SimSun; font-size:10px">即可评价该模型在数据集中
<br>的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">总体</span><span style="font-family: ASLVIP+SimSun; font-size:10px">拟合优度。对于不同模型，可以分别根据 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">最大，来筛选最优模型。对于同一模型的不同参数，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:3197px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">同样可以将其视为不同模型，并根据上述方法筛选。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:3213px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了筛选不同模型的最佳参数，可以通过遍历的方法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:3228px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">遍历模型参数的所有取值可能，再使用交叉验证的方
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:3244px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">法筛选参数。然而，遍历法的代价实在太大。为了降
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:3259px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">低计算机的运算负荷，可以适当地加大步长，并参入
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:378px; top:2664px; width:99px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">网格</span><span style="font-family: ASLVIP+SimSun; font-size:10px">中筛选最优参数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2680px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于网格寻优法从参数网格中寻找最佳参数，从这个
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:2695px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">意义上来说，网格寻优法可视为</span><span style="font-family: VDCSKW+SimHei; font-size:10px">大步长、动态步长、
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:373px; top:2711px; width:109px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">掺杂人工因素</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的遍历法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:322px; top:2733px; width:206px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.2.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">模型及其参数的筛选结果与 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">T </span><span style="font-family: VDCSKW+SimHei; font-size:11px">检验
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:2759px; width:229px; height:58px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">承上所述，为了选择最好的模型，首先需要筛选模型
<br>们的参数。而需要选择参数的模型有 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、
<br>决策树、随机森林和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。本文使用网格寻优
<br>法，结合 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折交叉验证，计算模型的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">F1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值作为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:346px; top:2817px; width:164px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">筛选模型，最终的结果如表</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">5</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:347px; top:2839px; width:157px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">1. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">各模型的参数网格与筛选结果
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:2858px; width:22px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">模型
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">kNN
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:2910px; width:21px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">SVC
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:3002px; width:32px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">决策树
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:3102px; width:43px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">随机森林
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:3153px; width:46px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">AdaBoost
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:375px; top:2858px; width:97px; height:331px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">参数网格
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">k</span><span style="font-family: XFSCCX+LMRoman8-Regular; font-size:11px">1</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">:(3,5,7,9,11)
<br>C</span><span style="font-family: XFSCCX+LMRoman8-Regular; font-size:11px">2</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">: (0,0.1,0.25,0.5
<br>,0.75,1,1.25,1.5,1.75,
<br>2,3,4,5,6,7,8,9)
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">核函数</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">:(</span><span style="font-family: ASLVIP+SimSun; font-size:10px">线性函数、
<br>径向基函数、
<br>三次多项式函数</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">)
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">最大深度 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">d:(7,9,
<br>11,13,15,17,19,24,29,
<br>34,39,44,49,54,59,64,
<br>69,74,79,84,89)
<br></span><span style="font-family: XFSCCX+LMRoman8-Regular; font-size:11px">3</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">: (0.00025,
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">(cid:11)</span><span style="font-family: JGTVJX+CMMI8; font-size:7px">cpp
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.0005,0.001,
<br>0.00125,0.015,
<br>0.01,0.05,0.1)
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">基模型个数：</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">(15,
<br>25,35,45,50,65,75,85,
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">95,100,150,200,250,
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">300)
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">基模型个数：</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">(15,
<br>25,35,45,50,65,75,85,
<br>95,100,150,200,250,
<br>300)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:485px; top:2858px; width:43px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">最佳结果
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:485px; top:2899px; width:43px; height:41px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">C</span><span style="font-family: BSCXIL+CMR10; font-size:10px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:10px">:</span><span style="font-family: BSCXIL+CMR10; font-size:10px">1</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">,
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">核函数：
<br>线性函数
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:485px; top:2992px; width:64px; height:30px;"><span style="font-family: TWCQZW+CMMI10; font-size:10px">d </span><span style="font-family: BSCXIL+CMR10; font-size:10px">= 79</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">,
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">(cid:11)</span><span style="font-family: JGTVJX+CMMI8; font-size:7px">cpp </span><span style="font-family: BSCXIL+CMR10; font-size:10px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:10px">:</span><span style="font-family: BSCXIL+CMR10; font-size:10px">0005
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:485px; top:3098px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">75
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:485px; top:3153px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">15
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:319px; top:3191px; width:192px; height:34px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">这里不妨啰嗦一句，</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">kNN </span><span style="font-family: ASLVIP+SimSun; font-size:9px">算法的 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">k </span><span style="font-family: ASLVIP+SimSun; font-size:9px">只能取奇数
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">2 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">即惩罚参数</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">.
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">3 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">即最小代价复杂度剪枝处理的阀值
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:3240px; width:230px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">得到最佳参数后，再次使用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折交叉验证的方法，计
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:320px; top:3256px; width:210px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">k </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 3 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">核函数为线性函数的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:3274px; width:353px; height:23px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">4</span><span style="font-family: ASLVIP+SimSun; font-size:9px">正则化用于解决过拟合问题，然而考虑到这些模型的拟合优度均较低，因此不使用正则化
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">5</span><span style="font-family: ASLVIP+SimSun; font-size:9px">可以看到，参数网格由疏到密，这实际是渗入人工因素的结果，具体见</span><span style="font-family: JXSLWJ+LMRoman9-Bold; font-size:12px">??
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:3313px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">4
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:86px; top:2958px; width:198px; height:161px;"></div><span style="position:absolute; border: black 1px solid; left:310px; top:2855px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:2873px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:2891px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:369px; top:2932px; width:109px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:2973px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:369px; top:3027px; width:109px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:3082px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:3136px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:3191px; width:244px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:3275px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:3418px; width:612px; height:792px;"></span>
<div style="position:absolute; top:3418px;"><a name="5">Page 5</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:3487px; width:229px; height:60px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 79</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; (cid:11)</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">cpp </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">0005 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的决策树、基模型为
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的决策树、个数为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">75 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的随机森林、基模型为
<br>逻辑回归、个数为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">15 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、朴素贝叶斯分
<br>类器，以及逻辑回归，分别计算它们在数据集中的拟
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:77px; top:3544px; width:223px; height:20px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">合优度序列 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">5</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。如表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">,</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:104px; top:3568px; width:169px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">各模型在数据集的拟合优度序列如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:83px; top:3589px; width:204px; height:137px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">2. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">各模型（最优参数下）的拟合优度序列
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: JGTVJX+CMMI8; font-size:7px">i </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">AdaBoost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">决策树 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">kNN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">逻辑回归
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">1
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">2
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">3
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">4
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">5
<br></span><span style="font-family: BSCXIL+CMR10; font-size:10px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:120px; top:3621px; width:19px; height:105px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.82
<br>0.85
<br>0.84
<br>0.87
<br>0.85
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.85
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:172px; top:3621px; width:19px; height:105px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.71
<br>0.73
<br>0.76
<br>0.78
<br>0.73
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.74
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:256px; top:3621px; width:19px; height:105px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.83
<br>0.87
<br>0.85
<br>0.87
<br>0.84
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.85
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:3621px; width:19px; height:105px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.52
<br>0.54
<br>0.53
<br>0.55
<br>0.53
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.53
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:157px; top:3755px; width:57px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">3. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">续上表
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:96px; top:3769px; width:178px; height:123px;"><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: JGTVJX+CMMI8; font-size:7px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">贝叶斯分类器 随机森林 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">SVC
<br>0.82
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">1
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.85
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">2
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.83
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">3
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.86
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">4
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.84
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S</span><span style="font-family: KOFJUT+CMR8; font-size:7px">5
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.84
<br></span><span style="font-family: BSCXIL+CMR10; font-size:10px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:10px">S
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:143px; top:3787px; width:19px; height:105px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.85
<br>0.86
<br>0.83
<br>0.87
<br>0.89
<br>0.86
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:209px; top:3787px; width:19px; height:105px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.45
<br>0.45
<br>0.46
<br>0.49
<br>0.46
<br>0.46
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:3916px; width:232px; height:59px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">从各模型的拟合优度序列的均值 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可以剔除决策树、
<br>随机森林和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。剩下的模型差别均不大。但是，
<br>人们不能贸然地认为这些模型在效果上是 </span><span style="font-family: VDCSKW+SimHei; font-size:10px">等价</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的。
<br>因此，为了判断这些模型是否等价，还需要采用 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">T
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:315px; top:3487px; width:221px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于所有模型两两 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">检验的概率均有 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">p &gt; (cid:11)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，故接
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:333px; top:3506px; width:189px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">受原假设，即认为各模型的效果两两等价。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:363px; top:3529px; width:125px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">文本分类模型的训练
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:3551px; width:223px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本节将根据 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">检验法的结果，从中挑选出一个</span><span style="font-family: VDCSKW+SimHei; font-size:10px">适合
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:383px; top:3570px; width:89px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">的</span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型，并训练它。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:369px; top:3595px; width:113px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.3.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">模型选择与分析
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:3618px; width:233px; height:138px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">根据</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节的分析结果可知，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、逻辑回归、
<br>贝叶斯分类器和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的效果是一样的。很明显，属
<br>于集成模型的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">所消耗的资源较多，没有必
<br>要选择它。而较之模型的训练时长而言，显然通过拙
<br>算法</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">6</span><span style="font-family: ASLVIP+SimSun; font-size:10px">训练的贝叶斯分类器，所需的训练时长最短。
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">而需要迭代算法求解的逻辑回归和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，在这方面
<br>略逊一筹。然而由于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">需要求解的优化问题</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">7</span><span style="font-family: ASLVIP+SimSun; font-size:10px">较为
<br>复杂。但另一方面，较之逻辑回归，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">只需要训
<br>练</span><span style="font-family: VDCSKW+SimHei; font-size:10px">支持向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。换句话说，在硬件实现上可以直接</span><span style="font-family: VDCSKW+SimHei; font-size:10px">剔除
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:3762px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">非支持向量个体，因此在训练模型时，消耗的内存较
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:418px; top:3777px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">低。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:3793px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另外，由于贝叶斯分类器是通过拙算法训练的，需要
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:3809px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">存储数据的频率信息。因此，使用贝叶斯分类器所消
<br>耗的内存 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(3MB) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">更多。并且，分类决策所需要的时
<br>间亦长。再加上数据预处理所需要的内存资源，使得
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:3855px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">贝叶斯分类器无法用在嵌入式系统等场合。而逻辑回
<br>归与 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">则相反，它们只需要存储模型参数 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(1MB
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:400px; top:3883px; width:56px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">左右</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">即可。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:3899px; width:225px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">有的读者可能会认为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">更具有稳定性（即每次训
<br>练时结果波动不大），这可能是由于支持向量机也叫
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:3933px; width:228px; height:26px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">最大间隔模型</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的原因。但不得不说，由于惩罚参数
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">并且接近于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，因此实际上该 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: VDCSKW+SimHei; font-size:10px">软化</span><span style="font-family: ASLVIP+SimSun; font-size:10px">得很
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:333px; top:3965px; width:189px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">彻底的，所以其稳定性高的谬论不攻自破。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:3980px; width:59px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">检验</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的方法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:3981px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">综上，在条件允许的情况（如个人电脑）下，可使用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:3993px; width:229px; height:91px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">类似于</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节所述的卡方检验，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">检验亦属于统计
<br>检验的方法。</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">检验用于判断两个序列的均值，在置
<br>信水平下是否相等。篇幅所限，这里不再复述其原
<br>理。于是，本文考虑将 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、逻辑回归、贝叶
<br>斯分类器和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的拟合优度序列，进行两两的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">检
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">验。设置置信水平为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">05</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，可得检验结果见表
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:153px; top:4086px; width:69px; height:14px;"><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: ASLVIP+SimSun; font-size:10px">见附录 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">A)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:3996px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">贝叶斯分类器。如果要求简单至上，轻装上阵，则可
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:372px; top:4008px; width:112px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">以选择逻辑回归和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4024px; width:229px; height:75px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另外值得注意的是，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的核函数为线性函数。也
<br>就是说，此时 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">与逻辑回归一样，属于</span><span style="font-family: VDCSKW+SimHei; font-size:10px">线性分类
<br>器</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。并且，我们看到非线性分类器，除了贝叶斯分类
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">器</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">8</span><span style="font-family: ASLVIP+SimSun; font-size:10px">以外，它们的效果无疑都很差。这是为什么呢？
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">笔者认为，这是由于特征过多，导致的数据集线性可
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:4105px; width:410px; height:35px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">6</span><span style="font-family: ASLVIP+SimSun; font-size:9px">通过存储数据的频率信息
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">7</span><span style="font-family: ASLVIP+SimSun; font-size:9px">即模型训练过程中，使得代价函数最小的问题
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">8</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">AdaBoost </span><span style="font-family: ASLVIP+SimSun; font-size:9px">属于 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">Boost </span><span style="font-family: ASLVIP+SimSun; font-size:9px">集成，线性模型的 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">Boost </span><span style="font-family: ASLVIP+SimSun; font-size:9px">集成还是线性的，这点笔者已经在之前的研究中验证过
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:4155px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">5
<br></span></div><span style="position:absolute; border: black 1px solid; left:77px; top:3605px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3623px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3641px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3659px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3677px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3695px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3712px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:3730px; width:216px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3771px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3789px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3807px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3825px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3843px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3861px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3879px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:90px; top:3897px; width:190px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:4106px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:4260px; width:612px; height:792px;"></span>
<div style="position:absolute; top:4260px;"><a name="6">Page 6</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:4333px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">分的缘故。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:145px; top:4355px; width:80px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.3.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">模型训练
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4333px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">文在将二元语法转换为词袋模型时，只保存非零元素
<br>的索引和值。这样可将数据压缩到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">4MB </span><span style="font-family: ASLVIP+SimSun; font-size:10px">左右，同时
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:336px; top:4360px; width:183px; height:13px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">节省了操作系统释放、存取内存</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">9</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的时间。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4381px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在得出模型之后，还需要将数据集拆分成训练集、测
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4380px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另外，由于特征的取值为频数，其值为整数且大多很
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4396px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">试集筛选数据。可能有读者认为这是多此一举，因为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:4395px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">小，因此可以将其转换为无符号短整型（即一个字
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4412px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在筛选模型的时候已经反复训练了。但并非如此，因
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:370px; top:4411px; width:115px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">节），从而节省存储开支。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4427px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为测试集的意义在于测试模型的拟合优度，人们总是
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4443px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">期望在陌生的数据中测试。如果测试集的信息在除测
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:4458px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">试以外的阶段“泄露”了，那么将会失去测试的意
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:178px; top:4474px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">义。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:4434px; width:134px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.4.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">多分类与类别不均衡
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4456px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对于逻辑回归和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型来说，由于其只能输出正
<br>负两个结果，故不可以直接用于多分类任务。所以笔
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:4489px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">所以说，如果直接拿交叉验证时训练的模型投入使
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4491px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">者用它们进行文本分类时，将不同类别的样本“分而
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4505px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">用，那么等于直接拿未经测试的模型投入使用。无论
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4521px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">是工业界还是学术界，这都是不可取的。因为无法评
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:4536px; width:189px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">价模型的泛化能力，并判断其是否过拟合。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4548px; width:232px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，本文将数据集按 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">7:3 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">拆分成训练集、测试集，
<br>在训练集中分别训练逻辑回归、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和决策树模型。
<br>并计算模型们在训练集、测试集中的 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">F </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值。结果如
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:161px; top:4595px; width:53px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:87px; top:4617px; width:197px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">4. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型在测试集、训练集中的拟合优度值
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:113px; top:4636px; width:12px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">F1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:113px; top:4663px; width:32px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">训练集
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:113px; top:4681px; width:32px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">测试集
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:4633px; width:32px; height:58px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">贝叶斯
<br>分类器
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.93
<br>0.88
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:202px; top:4633px; width:21px; height:58px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">逻辑
<br>回归
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">1
<br>0.87
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:236px; top:4636px; width:21px; height:56px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">SVC
<br>1
<br>0.86
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:101px; top:4712px; width:173px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">读者可以读取附件文件 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">nb_model.pkl</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:4727px; width:210px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">lg_model.pkl </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">svc_model.pkl </span><span style="font-family: ASLVIP+SimSun; font-size:10px">使用这些模型。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:106px; top:4751px; width:158px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.4 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">算法的底层实现与操作简介
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4772px; width:228px; height:44px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">笔者的计算机配置为：</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Inter(R) Core(TM)i5-5200U
<br>CPU2.20GHz</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，内存 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">8GB</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Win7 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">系统。在编程过程
<br>中，受硬件限制，遇到了很多不可避免的问题。另
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4822px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">外，上一小节所介绍的模型、以及数据预处理时潜在
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:4838px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">许多问题与相应的解决办法，下面将一一介绍这些底
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:4853px; width:59px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">层算法实现。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:129px; top:4875px; width:113px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.4.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">稀疏矩阵的存取
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:4897px; width:230px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节中，使用词袋模型处理数据将会得到一个
<br></span><span style="font-family: BSCXIL+CMR10; font-size:9px">9210 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">396287 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的矩阵。矩阵绝大部分为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">元素，若
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">直接生成，则需要 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">27GB </span><span style="font-family: ASLVIP+SimSun; font-size:10px">左右的内存空间。因此，本
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4507px; width:232px; height:87px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">治之”，从而将多分类任务转换为多个二分类任务。
<br>这里采用 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">OvR</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的分而治之策略，即在划分某一类
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，将不属于该类的样本视为负样本，从而转换为二
<br>分类问题。这么做比起 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">OvO</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">10 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">而言，其算法复杂度
<br>更低</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">11 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">。然而，这么做会使得正负样本数量不均衡，
<br>从而影响模型的拟合优度。举个简单的例子，若正负
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:4600px; width:225px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">样本比例为 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 : 99</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。那么一个只会点头的模型可能达
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4612px; width:229px; height:60px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">99% </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的精度，这显然不是人们愿意看到的。
<br>为了解决这种类别不均衡问题，笔者采用了</span><span style="font-family: VDCSKW+SimHei; font-size:10px">边界
<br></span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">SMOTE</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的过采样方法。该方法旨在通过少数类样
<br>本，使用插值法产生新的样本。其中边界样本产生更
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4678px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">多的新样本，从而降低简单复制粘贴数据导致模型过
<br>拟合的风险。其具体算法见参考文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，这里不再
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:4709px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">赘述。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:4732px; width:134px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.4.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">网格寻优的具体操作
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4758px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">前面提到，网格寻优法能够加入人工因素，从而避免
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4774px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">盲目地遍历参数。笔者在筛选参数的时候，先使用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">大
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4789px; width:229px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">步长</span><span style="font-family: ASLVIP+SimSun; font-size:10px">遍历大范围参数。并根据结果的左右边界，逐渐
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4805px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">减小步长，并缩小参数范围，从而更加精确地筛选模
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:408px; top:4821px; width:39px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">型参数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:347px; top:4844px; width:156px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.4.4 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">贝叶斯分类器的某些设置
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4870px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">贝叶斯分类器根据特征的连续与否，可以分为</span><span style="font-family: VDCSKW+SimHei; font-size:10px">多项式
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4885px; width:229px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">分布、伯努利分布和正态分布</span><span style="font-family: ASLVIP+SimSun; font-size:10px">三种方式。伯努利分布
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:4901px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">一般用在二值特征之中，因此不采用。而数据的词袋
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:4917px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">模型由频数构成，其天然具有离散特征的性质。所
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:4932px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">以，本文使用的贝叶斯分类器属于</span><span style="font-family: VDCSKW+SimHei; font-size:10px">多项式分布</span><span style="font-family: ASLVIP+SimSun; font-size:10px">类型。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:4947px; width:321px; height:35px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">9</span><span style="font-family: ASLVIP+SimSun; font-size:9px">由于 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">8GB </span><span style="font-family: ASLVIP+SimSun; font-size:9px">远远不够用，因此计算机需要将数据缓存到硬盘
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">10</span><span style="font-family: ASLVIP+SimSun; font-size:9px">即 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">one vs one </span><span style="font-family: ASLVIP+SimSun; font-size:9px">的缩写，是另一种分而治之的策略。同样地，</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">OvR </span><span style="font-family: ASLVIP+SimSun; font-size:9px">为 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">one vs rest
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">11</span><span style="font-family: ASLVIP+SimSun; font-size:9px">不难证明，</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">OvR </span><span style="font-family: ASLVIP+SimSun; font-size:9px">的复杂度为 </span><span style="font-family: HZJVKZ+CMMI9; font-size:8px">O</span><span style="font-family: GFFVFL+CMR9; font-size:8px">(</span><span style="font-family: HZJVKZ+CMMI9; font-size:8px">n</span><span style="font-family: GFFVFL+CMR9; font-size:8px">)</span><span style="font-family: ASLVIP+SimSun; font-size:9px">，而 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">OvO </span><span style="font-family: ASLVIP+SimSun; font-size:9px">为 </span><span style="font-family: HZJVKZ+CMMI9; font-size:8px">O</span><span style="font-family: GFFVFL+CMR9; font-size:8px">(</span><span style="font-family: HZJVKZ+CMMI9; font-size:8px">n</span><span style="font-family: UQLGDK+CMR6; font-size:5px">2</span><span style="font-family: GFFVFL+CMR9; font-size:8px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:4997px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">6
<br></span></div><span style="position:absolute; border: black 1px solid; left:107px; top:4633px; width:156px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:107px; top:4660px; width:156px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:107px; top:4678px; width:156px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:107px; top:4696px; width:156px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:4948px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:5102px; width:612px; height:792px;"></span>
<div style="position:absolute; top:5102px;"><a name="7">Page 7</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5175px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">当然，也可以将这种“自然数”类型的离散特征视为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5175px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，为了降低特征个数，本文考虑使用汉语分词器
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:119px; top:5187px; width:137px; height:13px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">连续型，从而采用正态分布</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">12</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5190px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将句子拆分成一个个的单词。同时，采用正向最长匹
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5207px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于这里采用了多项式分布类型的贝叶斯分类器，因
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:5206px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">配算法，过滤掉停用词、常见词。并采用类似的方
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5222px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">此为了提高模型的泛化程度，需要给模型进行平滑处
<br>理。本文采用的是一种 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">平滑策略</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">13</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，具体实现请
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:121px; top:5250px; width:133px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">参阅文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，这里不再赘述。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:134px; top:5280px; width:102px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">2.4.5 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">参数寻优算法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:5304px; width:222px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">除了 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、贝叶斯分类器以外，其余的所有模型都
<br>需要寻找某个参数，使得某个 </span><span style="font-family: VDCSKW+SimHei; font-size:10px">代价函数</span><span style="font-family: ASLVIP+SimSun; font-size:10px">最小，从而
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5339px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">得到模型的参数。换句话说，模型的训练（参数求解
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5354px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">过程）是一个优化问题。在本文中，笔者使用优化算
<br>法 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">LBFGS</span><span style="font-family: ASLVIP+SimSun; font-size:10px">来求解优化问题。</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">LBPGS </span><span style="font-family: ASLVIP+SimSun; font-size:10px">类似于拟牛顿
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">法的随机优化算法，它使用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">mini-batch </span><span style="font-family: ASLVIP+SimSun; font-size:10px">来降低计算
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:86px; top:5401px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">量。较之拟牛顿法，该方法节省内存，且采用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5413px; width:232px; height:45px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">mini-batch </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的它能够降低海赛矩阵的计算时间</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">14</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。本
<br>文使用的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">LBPGS </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的步长为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0.01</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">mini-batch </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">100
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">个样本。</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">LBFGS </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法具体细节可参阅参考文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:5463px; width:79px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">这里不过多复述。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:122px; top:5486px; width:126px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">3 </span><span style="font-family: VDCSKW+SimHei; font-size:12px">文本聚类与热度算法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5514px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">要从每一个群众的留言中，收集某一时间段内群众集
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5529px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">中反映的问题，显然属于一个文本聚类的问题。如果
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5545px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将群众相似的留言聚成一簇，即可将簇视为某个集中
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:5560px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">问题。根据该簇包含的留言条数、支持和反对的总
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5576px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">数，并考虑其热度随时间的衰减，即可估计该问题的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:173px; top:5592px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">热度。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5608px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了提取出聚类簇中留言的问题描述，以及地点和人
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5624px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">群。本文考虑使用关键语句提取算法，从而自动生成
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5639px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">关键句，再从关键句中人工提取出地点和人群。籍此
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:5655px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">就可以降低直接从留言详情中，人工提取问题概述的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:168px; top:5670px; width:39px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">工作量。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:150px; top:5695px; width:70px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">特征工程
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:5717px; width:230px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">不同于</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节，文本聚类属于无监督问题，其不能根
<br>据卡方检测来筛选特征。因此如果仍然采用二元语法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5221px; width:232px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">法，将地点、人名、机构名等在粗分的结果下合并。
<br>经过上述处理之后，再使用与</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节同样的方法，使
<br>用词袋模型将语料库转换为稀疏矩阵。为了进行特征
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5268px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">降维，文章还将采用主成分分析的方法，压缩数据的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:323px; top:5283px; width:209px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">维度。同样，这里的语料库亦均指代</span><span style="font-family: VDCSKW+SimHei; font-size:10px">留言详情</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:5306px; width:124px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.1.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">条件随机场分词器
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5332px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如上所述，为了将留言拆分成一个个单词，首先需要
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5347px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">使用机器学习的方法训练一个分词器。同样地，为了
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5363px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">训练分词器，就需要一个</span><span style="font-family: VDCSKW+SimHei; font-size:10px">事先拆分好</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的语料库作为训
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:5379px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">练集。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5394px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">一种获取语料库的方法是，在附件二的基础上手工分
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5410px; width:234px; height:88px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">词，但这么做的代价着实太大，得不偿失。因此，本
<br>文考虑采用开源的语料库，如 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SIGHAN05</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">15</span><span style="font-family: ASLVIP+SimSun; font-size:10px">提供的
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PKU </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">预料库。考虑到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">在标注一致性
<br>上要优于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PKU</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，这一点可以用历史报告佐证。并且
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的拆分颗粒度较大，一些地名 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">不予拆分，
<br>因此适合用在本场合中。</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">语料库分为训练、测
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:5503px; width:129px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">试语料库，其部分展示如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:5518px; width:230px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">“ 人们 常 说 生活 是 一 部 教科书 ，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:324px; top:5534px; width:203px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">而 血 与 火 的 战争 更 是 不可多
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:324px; top:5549px; width:202px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">得 的 教科书 ， 她 确实 是 名副其
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:344px; top:5565px; width:162px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">实 的 ‘ 我 的 大学 ’ 。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:339px; top:5581px; width:173px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">“ 心 静 渐 知 春 似 海 ，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:349px; top:5596px; width:152px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">花 深 每 觉 影 生 香 。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:365px; top:5612px; width:122px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">“ 吃 屎 的 东西 ，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:321px; top:5625px; width:209px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">连 一 捆 麦 也 铡 不 动 呀 ？</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">...
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:5643px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于汉语的分词问题实际上是一种序列标注的问题，
<br>定义标注集为 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">f</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B; M; E; S</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B; M; E; S </span><span style="font-family: ASLVIP+SimSun; font-size:10px">分别
<br>代表开头、中间、结尾和单个词。于是例句 </span><span style="font-family: AGCYGV+FangSong; font-size:10px">我爱第
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:5690px; width:129px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">八届泰迪杯挑战赛</span><span style="font-family: ASLVIP+SimSun; font-size:10px">拆分可得：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:346px; top:5703px; width:159px; height:13px;"><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">[</span><span style="font-family: AGCYGV+FangSong; font-size:10px">我</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/S, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">爱</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/S, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">第</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/B</span><span style="font-family: AGCYGV+FangSong; font-size:10px">八</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/M</span><span style="font-family: AGCYGV+FangSong; font-size:10px">届</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/E, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">泰</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/B
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:356px; top:5718px; width:138px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">迪</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/E, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">杯</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/S, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">挑</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/B</span><span style="font-family: AGCYGV+FangSong; font-size:10px">战</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/E, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">赛</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">/S]
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:5737px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此对于每一个汉字，都有一个状态与之对应。很显
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:5752px; width:189px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对语料库建模，将会导致特征个数非常多。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:5752px; width:463px; height:27px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">然这是一个分类问题，其亦可以使用机器学习解决。
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">12</span><span style="font-family: ASLVIP+SimSun; font-size:9px">不建议读者这么做，根据笔者的许多研究和实践，发现正态分布类型的准确度等指标往往较低，无论在特征连续与
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:5782px; width:224px; height:9px;"><span style="font-family: ASLVIP+SimSun; font-size:9px">否都是如此。笔者认为这是正态分布参数难以训练的结果
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:5789px; width:196px; height:34px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">13</span><span style="font-family: ASLVIP+SimSun; font-size:9px">即拉普拉斯修正系数为 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">的平滑策略
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">14</span><span style="font-family: ASLVIP+SimSun; font-size:9px">相应的收敛会减缓，但不影响收敛
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">15</span><span style="font-family: ASLVIP+SimSun; font-size:9px">第二届国际中文分词评测，可免费用于研究目的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:5839px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">7
<br></span></div><span style="position:absolute; border: black 1px solid; left:70px; top:5768px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:5944px; width:612px; height:792px;"></span>
<div style="position:absolute; top:5944px;"><a name="8">Page 8</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:6017px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">考虑到一个汉字的状态，与前面一个汉字的状态有
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6032px; width:228px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">关。因此，这里结构判别模型——</span><span style="font-family: VDCSKW+SimHei; font-size:10px">条件随机场模型</span><span style="font-family: ASLVIP+SimSun; font-size:10px">解
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6048px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">决。当然，笔者始终秉持着机器学习是一门实践的艺
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6063px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">术这一理念，在考虑条件随机场时，亦通过多种模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6079px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">筛选的方式，发现其效果最优，才采用该模型的，具
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:143px; top:6091px; width:90px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">体做法详见附录 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">B</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:6114px; width:230px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">条件随机场类似于隐马尔可夫模型，如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。其
<br>中特征 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t </span><span style="font-family: ASLVIP+SimSun; font-size:10px">由 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个连续的汉字 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; n</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g </span><span style="font-family: ASLVIP+SimSun; font-size:10px">组
<br>成，这里取 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 5</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。方块可以理解为一个特征函数
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:6163px; width:83px; height:13px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，而
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:77px; top:6171px; width:217px; height:20px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= (</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; y</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; y</span><span style="font-family: OPAKNF+CMR7; font-size:6px">5</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; y </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">B; M; E; S</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g </span><span style="font-family: ASLVIP+SimSun; font-size:10px">属于标签向
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:178px; top:6196px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">量。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:137px; top:6347px; width:97px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">4. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条件随机场原理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:342px; top:6012px; width:167px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.1.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">停用词过滤与命名实体合并
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6039px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在汉语中，有些词语如</span><span style="font-family: AGCYGV+FangSong; font-size:10px">的、啊、呢、换句话说、总而
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6054px; width:229px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">言之</span><span style="font-family: ASLVIP+SimSun; font-size:10px">等对句子的信息影响不大。并且，</span><span style="font-family: VDCSKW+SimHei; font-size:10px">标点符号</span><span style="font-family: ASLVIP+SimSun; font-size:10px">和制
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6070px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">表符等</span><span style="font-family: VDCSKW+SimHei; font-size:10px">特殊符号</span><span style="font-family: ASLVIP+SimSun; font-size:10px">亦不影响语义。因此需要在粗拆分的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6085px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">基础上过滤掉这些</span><span style="font-family: VDCSKW+SimHei; font-size:10px">停用词</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。另外，一些人民、地名和
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6101px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">机构名，以及数字等在粗拆分的基础上，需要将其再
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:408px; top:6116px; width:39px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">度合并。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6132px; width:231px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">这些都可以用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">正向最长匹配</span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法实现。该算法需要一
<br>个词典，以停用词过滤为例，这里使用的是 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">HanLP
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">开源词典</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">17</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。正向最长匹配从某个汉字开始，从前往
<br>后的扫描每个汉字。若途中构成的词存在于词典中，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:6195px; width:225px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">而与下一个汉字组合却不存在（即最长），则将其过
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:418px; top:6210px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">滤。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:6226px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对于人民、地名和机构名也是一样，结合某部字典，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:318px; top:6242px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">通过</span><span style="font-family: VDCSKW+SimHei; font-size:10px">正向最长匹配</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，在粗分的基础上进行再合并。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6258px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">通过上述条件随机场、停用词过滤等处理后，以附件
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:6273px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">三第二条留言为例，其分词结果如下所示。其中词
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:351px; top:6286px; width:149px; height:13px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">“</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">10</span><span style="font-family: AGCYGV+FangSong; font-size:10px">年</span><span style="font-family: ASLVIP+SimSun; font-size:10px">”就是粗分后合并的结果。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:6302px; width:230px; height:13px;"><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">[A</span><span style="font-family: AGCYGV+FangSong; font-size:10px">市</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, A, 6</span><span style="font-family: AGCYGV+FangSong; font-size:10px">区</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">道路</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">命名</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">规划</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">已经</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">初步</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">成
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:319px; top:6318px; width:212px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">果</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">公示</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">文件</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">转化</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">成为</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">正式</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">成果</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">希
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:6333px; width:230px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">望</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">加快</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">完成</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">路名</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">规范</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">道路</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">安装</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">路</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">名
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:319px; top:6349px; width:213px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">牌</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">变更</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">路</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">名牌</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">及时</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">更换</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, A, 6</span><span style="font-family: AGCYGV+FangSong; font-size:10px">区</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">农
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:6365px; width:230px; height:44px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">村</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">门牌</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, 10</span><span style="font-family: AGCYGV+FangSong; font-size:10px">年</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">未曾</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">更换</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">会</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">统一</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">更换</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">现
<br>在</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">找</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">地方</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">只能</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">说</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">路口</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">没有</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">充分</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">发
<br>挥</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">路名</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">地名</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">作用</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, A, 6</span><span style="font-family: AGCYGV+FangSong; font-size:10px">区</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">行政区划</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">已
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:6411px; width:227px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">经</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">调整</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">完毕</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">门牌</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">更新</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">应该</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">同步</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">开展</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">]
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:119px; top:6405px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是，条件随机场的定义如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:147px; top:6401px; width:12px; height:37px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T</span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∏
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:161px; top:6401px; width:34px; height:48px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">f </span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">K</span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:146px; top:6453px; width:13px; height:6px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: OPAKNF+CMR7; font-size:6px">=1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:181px; top:6454px; width:14px; height:6px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: OPAKNF+CMR7; font-size:6px">=1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:6430px; width:38px; height:19px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">p</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) =
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:130px; top:6432px; width:4px; height:9px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:122px; top:6446px; width:21px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">Z</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:198px; top:6430px; width:102px; height:20px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">w</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(3)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:352px; top:6438px; width:147px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.1.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">词袋模型与 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">PCA </span><span style="font-family: VDCSKW+SimHei; font-size:11px">降维
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:6474px; width:227px; height:11px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">w</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为待训练参数，而 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">Z</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为归一化系数，其
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:129px; top:6484px; width:14px; height:37px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:145px; top:6484px; width:12px; height:37px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T</span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∏
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:173px; top:6489px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">值为：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:159px; top:6484px; width:34px; height:48px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">f </span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">K</span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:6523px; width:32px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">Z</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) =
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:196px; top:6514px; width:81px; height:20px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">w</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">f</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:133px; top:6537px; width:4px; height:6px;"><span style="font-family: EFVUBV+CMMIB7; font-size:6px">y
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:145px; top:6537px; width:13px; height:6px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">t</span><span style="font-family: OPAKNF+CMR7; font-size:6px">=1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:180px; top:6537px; width:14px; height:6px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: OPAKNF+CMR7; font-size:6px">=1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6558px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">条件随机场的训练比较冗长，篇幅所限，这里不再详
<br>细介绍，具体内容可参阅参考文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。使用维特比
<br>算法 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">训练模型</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">16</span><span style="font-family: ASLVIP+SimSun; font-size:10px">后，即可用模型给语料库进行序
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:108px; top:6605px; width:159px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">列标注，并根据标注结果分词即可。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6465px; width:230px; height:150px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了将分词后的文档转换为结构化的特征向量，本文
<br>将再次采用</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节所述的词袋模型，从而将语料库转
<br>换为 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">4326 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">42754 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的稀释矩阵。同样，为了节省内
<br>存，这里仍旧采用</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节的方法存取稀疏数据。
<br>值得注意的是，将语料库转换为特征向量的方法还有
<br>许多，例如 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">TF-IDF</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。但是，若采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">TF-IDF</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，由于
<br>其实现是基于词袋模型的，如果采用这种方法，会导
<br>致稀疏特性遭到破坏，增大内存与 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CPU </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的负荷。其
<br>二，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">TF-IDF </span><span style="font-family: ASLVIP+SimSun; font-size:10px">将语料库中出现的高频词赋为低值，这
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">导致了诸如地名等重要特征无形中被忽视。对于聚类
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:6620px; width:461px; height:35px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">16</span><span style="font-family: ASLVIP+SimSun; font-size:9px">训练好的模型大小 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">156MB</span><span style="font-family: ASLVIP+SimSun; font-size:9px">，恕不上传。
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">17</span><span style="font-family: ASLVIP+SimSun; font-size:9px">见文件 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">stopwords.txt
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">18</span><span style="font-family: ASLVIP+SimSun; font-size:9px">实际上在少主题的聚类中使用 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">TF-IDF </span><span style="font-family: ASLVIP+SimSun; font-size:9px">方法会更好，比如图像的 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">BOW </span><span style="font-family: ASLVIP+SimSun; font-size:9px">聚类。对于这种多主题的聚类，</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">TF-IDF </span><span style="font-family: ASLVIP+SimSun; font-size:9px">实
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:6657px; width:206px; height:9px;"><span style="font-family: ASLVIP+SimSun; font-size:9px">在乏善可陈。这一点笔者已经在之前的研究中验证过
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:6681px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">8
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:79px; top:6234px; width:212px; height:117px;"></div><span style="position:absolute; border: black 1px solid; left:122px; top:6445px; width:21px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:6621px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:6786px; width:612px; height:792px;"></span>
<div style="position:absolute; top:6786px;"><a name="9">Page 9</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:114px; top:6855px; width:147px; height:13px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">问题而言，这样反而弄巧成拙</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">18</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:390px; top:6854px; width:70px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">文本聚类
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:6874px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于转换后数据的特征仍然居高不下，因此这里考虑
<br>对数据进行主成分分析（后文将简称为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA</span><span style="font-family: ASLVIP+SimSun; font-size:10px">）。考虑
<br>到特征个数远大于样本容量的情况下，样本通常线性
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6917px; width:229px; height:44px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可分。因此，这里采用的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA</span><span style="font-family: VDCSKW+SimHei; font-size:10px">不使用核函数</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。如
<br>图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示，存在一个两个特征的数据集，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">首先
<br>找出其</span><span style="font-family: VDCSKW+SimHei; font-size:10px">特征值和特征向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。根据数据所在两个方向的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:6968px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">稀疏程度，将数据投影到数据分布较稀的方向中。这
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:6983px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">是因为稀疏意味着相同量纲的情况下，信息量较大。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:81px; top:6995px; width:213px; height:15px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此在用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">降维时，往往将其投影到 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">e</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">上。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:6889px; width:234px; height:57px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">得到结构化的数据集后，就可以采用聚类方法，从而
<br>实现文本的聚类。常用的聚类算法有 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K-means </span><span style="font-family: ASLVIP+SimSun; font-size:10px">聚类，
<br>但其聚类簇数 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">需要人工择定，不太现实。因此本
<br>章将采用自适应的聚类算法——</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:368px; top:6987px; width:115px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.2.1 DBSCAN </span><span style="font-family: VDCSKW+SimHei; font-size:11px">原理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:7015px; width:232px; height:34px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">设数据集为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">X </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">f</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">m</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为样本容量。
<br>记 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的参数 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:440px; top:7039px; width:99px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为预设的阀值参数，定
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:434px; top:7032px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:389px; top:7055px; width:79px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">义相关概念如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:325px; top:7089px; width:221px; height:77px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">• </span><span style="font-family: VDCSKW+SimHei; font-size:10px">邻域</span><span style="font-family: ASLVIP+SimSun; font-size:10px">：类似于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">kNN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的近邻的概念，
<br>定 义 样 本 个 体 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻 域 为 距 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 欧
<br>氏 距 离 小 于 预 设 阀 值 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 个 体： </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) =
<br></span><span style="font-family: HFJSQP+CMSY10; font-size:17px">f</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:20) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">；同时定义 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为个体
<br></span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻域包含的个体数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:325px; top:7190px; width:172px; height:20px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">• </span><span style="font-family: VDCSKW+SimHei; font-size:10px">核心个体</span><span style="font-family: ASLVIP+SimSun; font-size:10px">：若个体满足 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j (cid:21) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:497px; top:7192px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:500px; top:7199px; width:40px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">，则称为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:338px; top:7214px; width:63px; height:11px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">核心个体。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:325px; top:7255px; width:221px; height:91px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">• </span><span style="font-family: VDCSKW+SimHei; font-size:10px">密度可达</span><span style="font-family: ASLVIP+SimSun; font-size:10px">：若个体 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">在核心个体 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻
<br>域内，则称 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。密度可达具有传递性：
<br>若 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">均为核心个体，且 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可
<br>达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，则 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。密度可达不具有对称
<br>性，即 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">不一定有 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，除非
<br>两者皆为核心个体。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:325px; top:7376px; width:216px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">• </span><span style="font-family: VDCSKW+SimHei; font-size:10px">密 度 相 连</span><span style="font-family: ASLVIP+SimSun; font-size:10px">：核 心 个 体 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可 达 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，则 称
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:338px; top:7395px; width:79px; height:11px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">密度相连。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:7430px; width:230px; height:60px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示，其中棱形个体为核心个体，且 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻域为图中的圆圈。直线相互连接的个体为密度可
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">达的核心个体串，核心个体串的 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻域内的所有个体
<br>密度相连。更加详细的描述可参阅参看文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:520px; top:7434px; width:25px; height:10px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">= 3</span><span style="font-family: ASLVIP+SimSun; font-size:10px">；
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:514px; top:7427px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:139px; top:7154px; width:92px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">5. </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">原理简介
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:7185px; width:229px; height:76px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">评价 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">效果可用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">解释方差占比</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，其可由单个变量
<br>与总方差的方差之比算出，能够一定程度说明该变量
<br>在总样本中的信息量。如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示，使用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">将其
<br>样本的特征个数从 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">42754 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个降低至 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1000 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">19</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，并且
<br>其累计的解释方差占比已经超过 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">90</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">%</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，可见其信息
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:118px; top:7266px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">下降不会太大，降维是合理的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:7430px; width:157px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">6. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">特征个数与累计解释方差占比
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:7465px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">至此，本文已将非结构的文本处理成结构化的数据
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:7480px; width:119px; height:28px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">集。
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">19</span><span style="font-family: ASLVIP+SimSun; font-size:9px">选择 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">1000 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">个特征的理由见
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:303px; top:7523px; width:5px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">9
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:86px; top:7022px; width:198px; height:136px;"></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:86px; top:7289px; width:198px; height:144px;"></div><span style="position:absolute; border: black 1px solid; left:70px; top:7496px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:7628px; width:612px; height:792px;"></span>
<div style="position:absolute; top:7628px;"><a name="10">Page 10</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:7692px; width:219px; height:34px;"><span style="font-family: HFJSQP+CMSY10; font-size:17px">f(cid:0)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 : 3924</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0 : 3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">3 : 3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">4 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">5 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">6 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">7 :
<br>2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">8 : 3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">9 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">10 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">11 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">12 : 3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">13 : 4</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">14 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">176 :
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:388px; top:7723px; width:75px; height:19px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">177 : 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">178 : 2</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:7751px; width:225px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可见，留言数大于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的簇共 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">179 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个，其余均为单条
<br>留言或离群留言。鉴于总留言数共 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">4326 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条，而离群
<br>留言共 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">3924 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条，可见聚类效果还是比较可观的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:7809px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了提取出热门问题，首先需要斟酌每一条留言的热
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:7816px; width:229px; height:20px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">度。设一条留言的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">关注度</span><span style="font-family: ASLVIP+SimSun; font-size:10px">为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 f</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">4326</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">g</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:7840px; width:230px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于发起一条留言所需要的代价远比点赞要大，因此
<br>设置每条留言的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">起始关注度</span><span style="font-family: ASLVIP+SimSun; font-size:10px">为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">10</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。留言每被</span><span style="font-family: VDCSKW+SimHei; font-size:10px">点赞</span><span style="font-family: ASLVIP+SimSun; font-size:10px">一
<br>次，增加 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">点关注度。考虑到热度的定义包含群众对
<br>问题的参与度，因此实际上留言的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">反对数</span><span style="font-family: ASLVIP+SimSun; font-size:10px">亦一定程度
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:7902px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">上反应了留言的热度。但是，由于反对终究是对留言
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:7918px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的否定，因此，文章考虑留言每被反对一次，则</span><span style="font-family: VDCSKW+SimHei; font-size:10px">倒扣
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:395px; top:7930px; width:65px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0.5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">点关注度。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:7956px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另一方面，考虑到留言的点赞数、反对数会随着时间
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:7972px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">逐渐积累，但相反问题会被解决。因此，必须考虑进
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:7987px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">时间因素，即历时较长的留言的关注度应适当降低。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:8003px; width:229px; height:27px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">不妨记留言 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">由于时间因素，导致的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">关注度降低量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">为
<br></span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。分析语料库中的时间，可以发现留言最近时间
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:335px; top:8030px; width:181px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">:</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">T</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">=</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">2020-1-26 19:47:11</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，而最远时间
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:8046px; width:219px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">:</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">2017-6-8 17:31:20</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，即最远的时间跨度为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">962
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:418px; top:8065px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">天。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:8088px; width:227px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">随着时间跨度 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">t</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，即与当前时间和 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的
<br>间隔天数</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">21</span><span style="font-family: ASLVIP+SimSun; font-size:10px">应呈指数关系。换句话说，当前关注度衰
<br>减在近几个月的衰减不明显，之后则显著增大。因
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:345px; top:8126px; width:166px; height:19px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">此，可以设 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">t </span><span style="font-family: ASLVIP+SimSun; font-size:10px">与 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的表达式为：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:377px; top:8154px; width:97px; height:19px;"><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">a </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">b</span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">t</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:8159px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(4)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:8199px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了求解出参数，文章统计出关注度最大的留言为
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">2107</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。因此，这里考虑经过 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">962 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">天后，关注度下跌
<br></span><span style="font-family: BSCXIL+CMR10; font-size:9px">2107</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。经过 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">962/2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">天后，关注度下跌 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2107/16</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。于是
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:329px; top:8246px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">令时间跨度以天数为单位，可列出方程组为：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:8233px; width:81px; height:82px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px"></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">a </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0) </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:398px; top:8284px; width:84px; height:22px;"><span style="font-family: OPAKNF+CMR7; font-size:6px">2 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">b</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 2107/16</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:367px; top:8289px; width:98px; height:32px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">a </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: BSCXIL+CMR10; font-size:9px">( </span><span style="font-family: OPAKNF+CMR7; font-size:6px">962
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">a </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(962</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">b</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">c </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 2107
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:8288px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(5)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:86px; top:7843px; width:197px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">7. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">核心个体</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">,</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻域、密度可达与密度相连
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:7877px; width:230px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，如果能找到如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示的核心个体串，将这些
<br>串的 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: ASLVIP+SimSun; font-size:10px">邻域构成的子集作为一个聚类簇，即可实现对
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">均影响
<br>样本的聚类。从图中亦可以看出，参数 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot;; m
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:263px; top:7905px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:7928px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">聚类结果。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:7940px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了实现聚类，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">通过任意地选择某个核心
<br>个体作为种子，并根据该核心个体的所有密度可达的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:7975px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">个体构成一个簇。显然，该簇的所有个体密度相连，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:7991px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">且有可能包含不止一个核心个体（即该种子能生成核
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:8006px; width:225px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">心个体串）。之后，再次选择一个不属于任何已知簇
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8022px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的核心个体作为种子，将其所有密度可达个体构成一
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:78px; top:8037px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">个簇，如此循环直到再无未归类的核心个体为止。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8054px; width:229px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">显然，经过上述循环后，可能仍具有许多的非核心个
<br>体未被聚类。在 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">中，一般将其标记为噪声
<br>点。另外，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">对个体进行聚类时一般采取先
<br>到先得的原则。即对于那些被多个核心个体可达、且
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8116px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">核心个体不属于同一簇的个体，它的所属簇等于其第
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:8132px; width:79px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">一次被划分的簇。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:118px; top:8158px; width:134px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.2.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">文本聚类与热度排行
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8182px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对于文档的聚类而言，显然经过 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">后，其噪
<br>声点或离群留言应该较多。换句话说，必须找到合适
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8217px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的参数，使得噪声点较多的情况下，尽可能产生较多
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:127px; top:8229px; width:121px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的、留言个数大于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的簇。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:183px; top:8242px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8245px; width:229px; height:60px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">= 2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">聚类算
<br>因此，文章使用 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; m
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">法</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">20</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，将每一句留言进行聚类，或者说将相近的留言
<br>归为一簇。于是，便可在簇的基础上，分析出热门问
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">题。聚类结果如下所示，其中 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">表示仅包含一条留
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:105px; top:8311px; width:166px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">言的簇。展示格式为“</span><span style="font-family: AGCYGV+FangSong; font-size:10px">簇：留言数</span><span style="font-family: ASLVIP+SimSun; font-size:10px">”：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:8326px; width:152px; height:23px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">20</span><span style="font-family: ASLVIP+SimSun; font-size:9px">其参数选择方法见底层实现的</span><span style="font-family: JXSLWJ+LMRoman9-Bold; font-size:12px">??</span><span style="font-family: ASLVIP+SimSun; font-size:9px">小节
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">21</span><span style="font-family: ASLVIP+SimSun; font-size:9px">本文在求时间的天数差时精确到秒
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:8365px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">10
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:86px; top:7698px; width:198px; height:149px;"></div><span style="position:absolute; border: black 1px solid; left:395px; top:8299px; width:11px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:8327px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:8470px; width:612px; height:792px;"></span>
<div style="position:absolute; top:8470px;"><a name="11">Page 11</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:124px; top:8539px; width:122px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">最后求得关注度降低函数为</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">:
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:8543px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">与其余所有句子之间的“间隔”均值，即处在中心位
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:97px; top:8563px; width:176px; height:19px;"><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 262</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">125 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">exp</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">0023∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">t</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">262</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">125
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:288px; top:8568px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(6)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:8596px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">有关关注度下跌函数的更多细节，请参阅附录 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">B</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br>由于留言的关注度不能为负数，因此结合时间衰减，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:8618px; width:8px; height:37px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">{
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:8631px; width:121px; height:34px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">留言 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的关注度为：
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 10 + </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">Y </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">5</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ ∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:8673px; width:4px; height:9px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">0
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:246px; top:8655px; width:26px; height:29px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&gt; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:20) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:288px; top:8669px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(7)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:88px; top:8698px; width:195px; height:11px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">Y; N </span><span style="font-family: ASLVIP+SimSun; font-size:10px">分别为点赞数和反对数，</span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">则根
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:109px; top:8710px; width:158px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">据</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">式与留言 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的时间跨度算出。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8730px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是，把聚类簇中的所有留言的关注度累加，即可得
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:8746px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">到该簇的热度。由于一个聚类簇中的留言是相近的，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8761px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">故一个簇可以视为一个问题。之后，根据每一个簇的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:92px; top:8773px; width:190px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">热度降序排行，即可找出前 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个热门问题。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:117px; top:8800px; width:136px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">内容归纳与关键句提取
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:8822px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">虽然根据</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节提取出来的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">大热点问题仅有一条
<br>留言。但考虑到留言的字数较多，并且实际应用中，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:8857px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">某个问题可能包含多条留言。这时为了概括问题描
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8872px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">述，可以将簇内的所有留言合并成一个</span><span style="font-family: VDCSKW+SimHei; font-size:10px">合成文档</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，再
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8888px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">使用某些算法进行关键句提取。本节将展示关键句提
<br>取算法——</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">TextRank </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法，并结合示例问题，展示
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:8919px; width:59px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其使用方法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:104px; top:8945px; width:162px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.3.1 PageRank </span><span style="font-family: VDCSKW+SimHei; font-size:11px">与 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">TextRank
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:8968px; width:229px; height:29px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PageRank </span><span style="font-family: ASLVIP+SimSun; font-size:10px">是 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Google </span><span style="font-family: ASLVIP+SimSun; font-size:10px">开发的一种用于网页排序的算
<br>法。其将互联网视为有向图，网页视为图的节点，两
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9003px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">个网页之间的连接为节点的有向边。每个网页的得分
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:9018px; width:220px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">取决于外链的数量，和所链接节点的得分。设节点
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">之间存在有向边，则其权重的可由式算出：
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) = (1 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) + </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:185px; top:9026px; width:14px; height:37px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:231px; top:9058px; width:4px; height:9px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:288px; top:9060px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(8)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:213px; top:9062px; width:66px; height:20px;"><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Out</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">V</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:175px; top:9073px; width:35px; height:14px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V</span><span style="font-family: SUGXJQ+CMMI5; font-size:4px">j</span><span style="font-family: FETXOR+CMSY7; font-size:12px">∈</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">In</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">V</span><span style="font-family: NOYDTF+LMRoman5-Regular; font-size:7px">i</span><span style="font-family: OPAKNF+CMR7; font-size:6px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:9098px; width:226px; height:30px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V </span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为节点 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的得分，其初始化分数为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">是一个常数因子，模拟用户点击网页 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">中
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:372px; top:8558px; width:111px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">置的句子 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的取值越大。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:8574px; width:227px; height:44px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于所有句子都存在链接，因此直接使用式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">是不
<br>可取的。这是因为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Out</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">V</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">等于链接个数，因此无
<br>法区分句子是否“关键”。因此，这里定义两条句子
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:377px; top:8620px; width:102px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的相似度 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">BM25 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:394px; top:8620px; width:14px; height:37px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:320px; top:8659px; width:70px; height:10px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">BM </span><span style="font-family: BSCXIL+CMR10; font-size:9px">25(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) =
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:410px; top:8659px; width:55px; height:10px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">IDF </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:371px; top:8673px; width:126px; height:36px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: OPAKNF+CMR7; font-size:6px">=1
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">T F </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ 1)
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">T F </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) + </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(1 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:12) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:12)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:8686px; width:12px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(9)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:499px; top:8690px; width:17px; height:21px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">|</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V</span><span style="font-family: SUGXJQ+CMMI5; font-size:4px">i</span><span style="font-family: FETXOR+CMSY7; font-size:12px">|
<br></span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">DL </span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:8727px; width:229px; height:135px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11); (cid:12) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为待定常数，其值分别决定了词频、句子
<br>长度为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">BM25 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的贡献。</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">越大，则词频对权重的正
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">面影响越大。若 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:12) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">越大，则句子长度对权重的负面
<br>影响越大。这里无差别对待，即取 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11); (cid:12) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 1</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为
<br>句子包含的单词量，因此计算时需要用</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">条件
<br>随机场模型</span><span style="font-family: ASLVIP+SimSun; font-size:10px">分词。</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">DL </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为</span><span style="font-family: VDCSKW+SimHei; font-size:10px">合成文档</span><span style="font-family: ASLVIP+SimSun; font-size:10px">中，每一条句子的
<br>平均词量。以句子为单位，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">T F; IDF </span><span style="font-family: ASLVIP+SimSun; font-size:10px">分别为单词在
<br>文档中的词频和逆文本频率。其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">T F </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">等
<br>于单词 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">在句子 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">中的词频。而 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">IDF </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的计算
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:404px; top:8867px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">公式如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:363px; top:8892px; width:85px; height:15px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">IDF </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) = </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">log</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:463px; top:8890px; width:6px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">S
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:450px; top:8903px; width:33px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">DF </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ 1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:484px; top:8896px; width:3px; height:9px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:523px; top:8892px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(10)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:8926px; width:222px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为合成文档中句子的总数，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">DF </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为包含单词
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:383px; top:8941px; width:89px; height:11px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">term</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的句子个数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:8956px; width:234px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是，仿照式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，即可找出每一条句子的“得分”，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:394px; top:8976px; width:69px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">具体算式如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:343px; top:8992px; width:14px; height:37px;"><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:315px; top:8999px; width:80px; height:42px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) = (1 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)+
<br></span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:2)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:332px; top:9039px; width:35px; height:14px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V</span><span style="font-family: SUGXJQ+CMMI5; font-size:4px">j</span><span style="font-family: FETXOR+CMSY7; font-size:12px">∈</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">In</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">V</span><span style="font-family: NOYDTF+LMRoman5-Regular; font-size:7px">i</span><span style="font-family: OPAKNF+CMR7; font-size:6px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:399px; top:9024px; width:60px; height:10px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">BM </span><span style="font-family: BSCXIL+CMR10; font-size:9px">25(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:382px; top:9036px; width:105px; height:14px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V</span><span style="font-family: SUGXJQ+CMMI5; font-size:4px">k</span><span style="font-family: FETXOR+CMSY7; font-size:12px">∈</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">Out</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">V</span><span style="font-family: NOYDTF+LMRoman5-Regular; font-size:7px">j</span><span style="font-family: OPAKNF+CMR7; font-size:6px">) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">BM </span><span style="font-family: BSCXIL+CMR10; font-size:9px">25(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:489px; top:9031px; width:24px; height:10px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">S</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:523px; top:9026px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(11)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9064px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">设每条句子的初始化分数为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，经过多次迭代后，即
<br>可得出每条句子的最终得分。之后，根据得分最高原
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9099px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">则，即可找到合成文档中的关键句。算法详情可参阅
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:374px; top:9111px; width:108px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">参考文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的第 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">9 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">章。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:153px; top:9133px; width:69px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的链接的概率。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:9133px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">之后，从该关键句人工提取出对应簇的问题描述即
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9145px; width:338px; height:46px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PageRank </span><span style="font-family: ASLVIP+SimSun; font-size:10px">推广到关键句中，以句子为节点</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">22</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，设
<br>每条句子两两之间都存在链接。并记 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">d </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为当前句子
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">22</span><span style="font-family: ASLVIP+SimSun; font-size:9px">这一点可以用正向匹配，搜索逗号、句号、分号等，从而将句子从文本中提取出来
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:418px; top:9164px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">取。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9149px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可。至于地区或人群，可以在关键句相应的留言中提
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:9207px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">11
<br></span></div><span style="position:absolute; border: black 1px solid; left:213px; top:9070px; width:40px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:371px; top:8696px; width:145px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:499px; top:8705px; width:12px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:450px; top:8902px; width:33px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:371px; top:9036px; width:116px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:9180px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:9312px; width:612px; height:792px;"></span>
<div style="position:absolute; top:9312px;"><a name="12">Page 12</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:134px; top:9380px; width:102px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.3.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">热点问题归纳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:9380px; width:134px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.4.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">截断奇异值分解降维
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:9410px; width:232px; height:87px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">经过聚类后，计算聚类簇的热度指数，并降序排行后
<br>即可找到前 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个热点问题。若热点问题只包含</span><span style="font-family: VDCSKW+SimHei; font-size:10px">一条</span><span style="font-family: ASLVIP+SimSun; font-size:10px">留
<br>言，那么可以</span><span style="font-family: VDCSKW+SimHei; font-size:10px">直接</span><span style="font-family: ASLVIP+SimSun; font-size:10px">人工概括，或</span><span style="font-family: VDCSKW+SimHei; font-size:10px">提取多条关键句后再
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">概括 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: ASLVIP+SimSun; font-size:10px">本文采用此种办法</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。若某问题包含多条留言，
<br>可取留言的时间跨度作为问题的时间范围。根据</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小
<br>节提取出关键句后，</span><span style="font-family: VDCSKW+SimHei; font-size:10px">根据关键句人工概括</span><span style="font-family: ASLVIP+SimSun; font-size:10px">问题描述，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9503px; width:231px; height:43px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">并从关键句所在的留言中，提取其地区范围和人群。
<br>考虑到藉由上述热度排行算法得出的前 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个热点问
<br>题，它们包含的留言只有一条。为了展示关键句提取
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:90px; top:9548px; width:195px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算法的使用，笔者总结了前 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">10 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个热点问题。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:9565px; width:230px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">以热度排行第 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">7 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的问题为例，其包含 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">9 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条留言。为
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">了得到问题的总体描述，可以将这 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">9 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条留言合并成
<br>一个文档，再进行 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">TextRank </span><span style="font-family: ASLVIP+SimSun; font-size:10px">排行，从而提取出 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">3 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:9616px; width:89px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">关键句，如下所示：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:83px; top:9631px; width:204px; height:13px;"><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">[</span><span style="font-family: AGCYGV+FangSong; font-size:10px">请问</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">A7</span><span style="font-family: AGCYGV+FangSong; font-size:10px">县文体中心乒羽中心何时能正式对外开
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:9646px; width:230px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">放</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">请问北横线还修吗</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">, </span><span style="font-family: AGCYGV+FangSong; font-size:10px">请问</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">A4</span><span style="font-family: AGCYGV+FangSong; font-size:10px">区植基路为什么一直
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:9662px; width:45px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">是断头路</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">]
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9682px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">结合三条关键句及其所在留言，可以断定该热点问题
<br>为咨询和反馈 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">A </span><span style="font-family: ASLVIP+SimSun; font-size:10px">市城市规划有关问题。同理，可以
<br>归纳出前 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">10 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个热点问题，它们的详细描述详见附件
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:153px; top:9725px; width:70px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Excel </span><span style="font-family: ASLVIP+SimSun; font-size:10px">表格。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:150px; top:9758px; width:70px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.4 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">底层实现
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9787px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了使文中所描述的方法投入工业应用之中，需要适
<br>当地改进一些算法。另外，本节还将解释 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:108px; top:9818px; width:159px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的参数来源，以及调参根据和方法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:9403px; width:227px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于普通的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">降维需要进行归一化处理，并且需
<br>要求解数据的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">协方差矩阵</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的特征值、特征向量。然
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9438px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">而，由于笔者在处理数据时，将稀疏数据用特殊方式
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9453px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">保存。因此归一化、求解协方差矩阵等操作，势必会
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:9469px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">破坏矩阵的稀疏性，或生成新的、等尺寸的稠密矩
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:334px; top:9481px; width:187px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">阵。所以说，直接进行 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">是不可取的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9497px; width:229px; height:44px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，本章在实现 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA </span><span style="font-family: ASLVIP+SimSun; font-size:10px">降维时，采用了截断奇异值
<br>分解降维的方法。该方法的原理类似于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">PCA</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，即考
<br>虑将数据投影到某个特征向量所在的方向中。但其计
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9547px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算特征向量时，直接在稀疏数据中算出，因此可以防
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:388px; top:9563px; width:79px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">止稀疏性被破坏。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:357px; top:9587px; width:137px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.4.3 DBSCAN </span><span style="font-family: VDCSKW+SimHei; font-size:11px">调参过程
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:9615px; width:79px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本文采用 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; m
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:394px; top:9607px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:399px; top:9611px; width:137px; height:14px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">= 2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">进行聚类，其
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:336px; top:9630px; width:179px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">效果经过是筛选与斟酌的。笔者发现，当
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9637px; width:229px; height:49px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">&gt; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">8</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">10) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，样本个数大于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的聚类簇数
<br></span><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">显著减少，从原本的近 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">200 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个降低至不到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">50 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个，而
<br>每一个聚类簇包含的样本数仍旧是个位数。因此有理
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:386px; top:9692px; width:138px; height:10px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">&gt; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">会使得聚类的区分度过高。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9685px; width:229px; height:64px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">由认为，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span><span style="font-family: FETXOR+CMSY7; font-size:12px">′ </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2]</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; &quot; &gt; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">4 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，虽然样本个数大于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的聚
<br>当 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">8</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">类簇数并没有太大的变化，但是某个聚类簇却包含了
<br>大于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">100 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个样本。所以，有理由认为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; &gt; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">:</span><span style="font-family: BSCXIL+CMR10; font-size:9px">5</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，会导
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:383px; top:9755px; width:89px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">致区分度过分下降。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:9762px; width:80px; height:34px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">当 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">8</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">出现。综上当 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">8</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:340px; top:9762px; width:200px; height:34px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′ </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2]</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; &quot; </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:20) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">3 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">, </span><span style="font-family: ASLVIP+SimSun; font-size:10px">很少有样本数大于 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">10 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的簇
<br></span><span style="font-family: FETXOR+CMSY7; font-size:12px">′ </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2]</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">8</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">(3</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">4] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时效果最佳。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:336px; top:9802px; width:179px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">经过热度排行的实际效果考量，文章选择
<br></span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">作为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的参数。</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">￿
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:338px; top:9818px; width:36px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">&quot; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 4</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; m
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:374px; top:9811px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:9850px; width:156px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">3.4.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">关于匹配算法的查询优化
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:350px; top:9839px; width:150px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">4 </span><span style="font-family: VDCSKW+SimHei; font-size:12px">答复相关性与完整性分析
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:9866px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了评价留言答复的相关性，本文将考虑采用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">词向
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9880px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在进行停用词过滤（见小节）等过程中，需要使用到
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9881px; width:229px; height:10px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，即深度学习的方法，从而比较留言详情与留言答
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:9895px; width:224px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">预定义的 </span><span style="font-family: AGCYGV+FangSong; font-size:10px">字典</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。并且每一次扫描的过程中，都需要
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9897px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">复之间的相似度。并以两个文本的相似度，作为答复
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:9911px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">查询一次字典，已确定是否有匹配的单词。如果使用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9912px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的相关性的评价指标。考虑到留言和答复的字数都比
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:9927px; width:224px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">顺序的存储结果，则每次查询的复杂度为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">O</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，显
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9928px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">较长，且词向量的运算也是非常耗时的。因此，在计
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:9942px; width:79px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">然是非常巨大的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:9960px; width:232px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">虽然将字典存储成</span><span style="font-family: VDCSKW+SimHei; font-size:10px">哈希表</span><span style="font-family: ASLVIP+SimSun; font-size:10px">能够使搜索的复杂度降至
<br></span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">23</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，但是在词典较大的情况下，还是不推荐使用的。
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">考虑到停用词字典的大小才 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">7KB </span><span style="font-family: ASLVIP+SimSun; font-size:10px">左右，因此本文在
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9944px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算文本相似度之前，首先采用小节所述的关键句提取
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9959px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">方法，从留言和答复中提取出若干关键句，再比较关
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:318px; top:9975px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">键句之间的相似度，从而作为答复的相关性指标。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:9991px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">顺带一提，区别于文本分析、文本聚类，两文本相似
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:123px; top:10006px; width:129px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">实现的使用哈希表匹配字典。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:10006px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">度的评价问题较为复杂，这是依赖手工提取特征的机
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:10021px; width:240px; height:12px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">23</span><span style="font-family: ASLVIP+SimSun; font-size:9px">内存运地址运算包含于芯片的指令集中，不消耗 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">CPU </span><span style="font-family: ASLVIP+SimSun; font-size:9px">周期
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:10049px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">12
<br></span></div><span style="position:absolute; border: black 1px solid; left:70px; top:10022px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:10154px; width:612px; height:792px;"></span>
<div style="position:absolute; top:10154px;"><a name="13">Page 13</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10227px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">器学习方法所处理不了的。因此，为了评价两个文本
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:10242px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的相似度，往往需要神经网络这种“黑盒子”模型，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:128px; top:10258px; width:119px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">从而自动文本的特征信息。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10274px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了评价答复的完整性与可理解性，本文将采用二元
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10289px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">语法字典，以相邻两个汉字为单位，逐一扫描留言答
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:10223px; width:231px; height:76px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">亦将其按</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节的方法处理。当然，语料库包含的词
<br>量远远不止这些。此时可以定义一个窗口，以 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为步
<br>长从前往后第扫描语料库。并以使用窗口中心和其前
<br>后单词，通过 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot </span><span style="font-family: ASLVIP+SimSun; font-size:10px">编码法生成多个上下文向量。
<br>注意，此时一个词对应的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot </span><span style="font-family: ASLVIP+SimSun; font-size:10px">向量长度应等于语
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10305px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">复。并匹配当前扫面是否存在于二元语法字典中，从
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:383px; top:10304px; width:89px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">料库中，词的总数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:103px; top:10320px; width:169px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">而判断留言的局部完整性核可理解性。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:134px; top:10343px; width:103px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">答复相关性评价
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10369px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如上所述，为了评价答复的相关性，本文将采用词向
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10384px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">量的方法。运用关键句提取算法，分别从留言和答复
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10400px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">中提取出其关键句。计算它们的词向量后，只要根据
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:10415px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">词向量之间夹角的余弦值，即可度量它们的相似度。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:105px; top:10440px; width:161px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.1.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">词向量与 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">one-hot </span><span style="font-family: VDCSKW+SimHei; font-size:11px">编码法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10463px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">语言学家 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">J.R.Firth </span><span style="font-family: ASLVIP+SimSun; font-size:10px">认为，意义相似的词语，其上下
<br>文也是相似的。根据这一点，每个词语都存在一个上
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10498px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">下文</span><span style="font-family: VDCSKW+SimHei; font-size:10px">词向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">与之对应，使得统计模型能够根据该单词
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10513px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的词向量预测到该单词。换句话说，词向量表达了该
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:173px; top:10529px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">单词。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10545px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">但是，如何找到每一个单词的词向量呢？显然，使用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10560px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">人工分析的方法犹如天方夜谭，而根据手工特征为底
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10576px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">子的机器学习亦难于胜任。因此，在这个问题上，能
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10591px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">够自动提取特征的深度学习，即神经网络的方法有着
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:10607px; width:59px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">天然的优势。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10623px; width:229px; height:56px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了训练出每个单词的词向量，显然需要一个分词过
<br>的语料库。这里仍旧采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SHANG05 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的开源语料库
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">作为训练集。把单词所在位置的前</span><span style="font-family: VDCSKW+SimHei; font-size:10px">后相邻的词
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">作为训练特征，称为</span><span style="font-family: VDCSKW+SimHei; font-size:10px">上下文向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。为了将单词转换为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:10685px; width:220px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">训练特征，这里采用处理无序字符串常用的编码法
<br>——</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">one-hot</span><span style="font-family: ASLVIP+SimSun; font-size:10px">编码法，对单词的上下文进行编码。
<br>例如假设训练语料库为</span><span style="font-family: AGCYGV+FangSong; font-size:10px">“第八届 泰迪杯 建模 挑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:10730px; width:229px; height:13px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">战 赛”</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，则单词</span><span style="font-family: AGCYGV+FangSong; font-size:10px">“建模”</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的上下文向量为</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">[</span><span style="font-family: AGCYGV+FangSong; font-size:10px">第八届，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:175px; top:10748px; width:19px; height:10px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">泰迪
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:10760px; width:246px; height:45px;"><span style="font-family: AGCYGV+FangSong; font-size:10px">杯，挑战，赛</span><span style="font-family: JYNVVD+LMMono10-Regular; font-size:13px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。使用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot </span><span style="font-family: ASLVIP+SimSun; font-size:10px">编码法，可转换为 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x </span><span style="font-family: BSCXIL+CMR10; font-size:9px">=
<br>[[1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0]</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">[0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0]</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">[0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0]</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">[0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1]</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T </span><span style="font-family: BSCXIL+CMR10; font-size:9px">]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，
<br>并与单词 </span><span style="font-family: AGCYGV+FangSong; font-size:10px">“建模”</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot </span><span style="font-family: ASLVIP+SimSun; font-size:10px">编码 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">y </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= [0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0]</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">T
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:173px; top:10810px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对应。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:351px; top:10329px; width:149px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.1.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">词向量与 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">CBOW </span><span style="font-family: VDCSKW+SimHei; font-size:11px">模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:10352px; width:228px; height:76px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">上一小节我们知道，语料库每一个单词</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">24</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">向量 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，都对应一个或多个上下文向量 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。因此，
<br>若以 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为输入，以 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为输出，就可以训练一个深
<br>度学习模型了，其中比较著名的就有 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型。
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型实际上是一个三层的神经网络模型，如
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:403px; top:10430px; width:50px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:372px; top:10697px; width:106px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">8. </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型概述</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:10724px; width:230px; height:20px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; C </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为单词 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot </span><span style="font-family: ASLVIP+SimSun; font-size:10px">编码后的列向
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:10745px; width:230px; height:91px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">量，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: ASLVIP+SimSun; font-size:10px">取决于窗口的大小。设窗口大小为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">, </span><span style="font-family: ASLVIP+SimSun; font-size:10px">则
<br></span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">m</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。矩阵 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: HWRQZF+MSBM10; font-size:13px">R</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n</span><span style="font-family: FETXOR+CMSY7; font-size:12px">×|</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V </span><span style="font-family: FETXOR+CMSY7; font-size:12px">|
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">网络的输入
<br></span><span style="font-family: FETXOR+CMSY7; font-size:12px">′ </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: HWRQZF+MSBM10; font-size:13px">R</span><span style="font-family: FETXOR+CMSY7; font-size:12px">|</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V </span><span style="font-family: FETXOR+CMSY7; font-size:12px">|×</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为隐藏层到输
<br>层到隐藏层的权重矩阵，</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">出层的权重矩阵。其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为隐藏层的节点数</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">25</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，需要
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">人工择定；</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">V </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为语料库中单词“种”数。</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的第 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">i
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">个列向量为语料库中第 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个单词的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">输入词向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，记为
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:10826px; width:353px; height:50px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">显然，这是一个极其稀疏的矩阵，因此在存取方面，
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">24</span><span style="font-family: ASLVIP+SimSun; font-size:9px">除去开头的，前窗口半径个单词外
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">1</span><span style="font-family: ASLVIP+SimSun; font-size:9px">本图改自 </span><span style="font-family: WZWWZS+LMMono9-Regular; font-size:11px">https://docs.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">25</span><span style="font-family: ASLVIP+SimSun; font-size:9px">本文采用的 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:9px">模型的隐藏层节点个数为 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">100 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">个。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:10891px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">13
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:326px; top:10457px; width:198px; height:243px;"></div><span style="position:absolute; border: black 1px solid; left:70px; top:10842px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:10996px; width:612px; height:792px;"></span>
<div style="position:absolute; top:10996px;"><a name="14">Page 14</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:186px; top:11223px; width:14px; height:37px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">|</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V </span><span style="font-family: FETXOR+CMSY7; font-size:12px">|</span><span style="font-family: ZWGJEJ+CMEX10; font-size:37px">∑
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:11062px; width:2px; height:12px;"><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:154px; top:11069px; width:142px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">中第 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个行向量为第 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个单词的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:11069px; width:234px; height:41px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">w</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: ASLVIP+SimSun; font-size:10px">；类似地，</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W
<br></span><span style="font-family: FETXOR+CMSY7; font-size:12px">′
<br></span><span style="font-family: VDCSKW+SimHei; font-size:10px">输出词向量</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，记为 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">w
<br></span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。由于输入词向量离输入层近，
<br>所以在应用中，一般</span><span style="font-family: VDCSKW+SimHei; font-size:10px">抛弃</span><span style="font-family: ASLVIP+SimSun; font-size:10px">输出词向量，将</span><span style="font-family: VDCSKW+SimHei; font-size:10px">输入词向量
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:123px; top:11115px; width:129px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">作为单词的词向量模块使用。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:89px; top:11132px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">输入层输入经过隐藏层节点后，转换为向量：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:11139px; width:223px; height:20px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">v</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W x</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">v</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W x</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">v</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">C </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W x</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">C</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。将其求取
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:90px; top:11163px; width:191px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">平均后得到 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">v </span><span style="font-family: ASLVIP+SimSun; font-size:10px">供给输出层，从而得到净输出
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:147px; top:11179px; width:30px; height:9px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">u </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:179px; top:11172px; width:2px; height:12px;"><span style="font-family: UGKPJB+CMBSY7; font-size:12px">′
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:182px; top:11170px; width:36px; height:19px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">v </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: HWRQZF+MSBM10; font-size:13px">R</span><span style="font-family: FETXOR+CMSY7; font-size:12px">|</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">V </span><span style="font-family: FETXOR+CMSY7; font-size:12px">|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:219px; top:11179px; width:9px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:75px; top:11192px; width:225px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">最后通过 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">softmax </span><span style="font-family: ASLVIP+SimSun; font-size:10px">激活函数，得到 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的预测值
<br></span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">softmax</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">u</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。如同所有机器学习模型一样，为
<br>了训练出词向量矩阵 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W </span><span style="font-family: ASLVIP+SimSun; font-size:10px">，需要定义损失函数如下：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:130px; top:11253px; width:54px; height:19px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">H</span><span style="font-family: BSCXIL+CMR10; font-size:9px">( ¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) = </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:202px; top:11258px; width:38px; height:15px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">log</span><span style="font-family: BSCXIL+CMR10; font-size:9px">( ¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">y</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:283px; top:11258px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(12)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:187px; top:11276px; width:13px; height:6px;"><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: OPAKNF+CMR7; font-size:6px">=1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:11295px; width:227px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，求取词向量的过程就等价为让式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">最小的优
<br>化问题。同样地，该优化问题可以使用随机搜索算
<br>法，如 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">LBFGS</span><span style="font-family: ASLVIP+SimSun; font-size:10px">、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Adam </span><span style="font-family: ASLVIP+SimSun; font-size:10px">等算法求解。有关 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:109px; top:11342px; width:157px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">模型，可以参阅 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">获取详细解释。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:129px; top:11375px; width:113px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.1.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">答复相关度计算
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:11400px; width:226px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">通过 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">语料库，求出 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型的参数后，即
<br>可通过输入权重矩阵 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">W </span><span style="font-family: ASLVIP+SimSun; font-size:10px">得出某个词的词向量。于
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:11435px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">是，对于某一条句子，找出一条句子中每一个单词的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:11450px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">词向量。之后，将这些词向量取平均，得到一个稠密
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:77px; top:11466px; width:217px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">向量 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。于是，对于任意的两条句子，都可以通过
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:83px; top:11481px; width:205px; height:11px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j </span><span style="font-family: ASLVIP+SimSun; font-size:10px">夹角的余弦值，作为句子的相似度，如式
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:352px; top:11064px; width:147px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">答复的完整度与可理解性
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11090px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">虽然答复的完整性可以使用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">语法分析</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，来判断一个句
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11106px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">子是否通顺。但是要进行语法分析，首先要对单词进
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11121px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">行</span><span style="font-family: VDCSKW+SimHei; font-size:10px">词性标注</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。如同分词一般，词性标注亦属于一个序
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11137px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">列标注问题，需要训练一个机器学习模型从而自动标
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11153px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">注。标注之后，还需要训练又一个模型，从而进行语
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11168px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">法分析。虽然可以通过一些开源的语料库解决以上两
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11184px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">个问题，但是为了根据语法判断句子是否通顺，或计
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:338px; top:11199px; width:179px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">算其通顺程度，还需要再训练一个模型。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11216px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">实际上，词汇标注和语法分析在业界都已实现，笔者
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11231px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">之前业已实现过，但最后一步却是最难的。首先，根
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:11247px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">据语法计算句子的通顺程度，是一个回归问题。另
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11263px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">外，最重要的一点是语料库的匮乏和不统一。如何根
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11278px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">据语法判断是否完整和通顺，就已经众说纷纭了，更
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:11294px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">何况给其打分。因此，要使用机器学习的方法完成，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:11306px; width:127px; height:13px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">目前似乎没有可行的办法</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">26</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:11326px; width:231px; height:72px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是，文章姑且采用二元语法与字典匹配，来判断句
<br>子的</span><span style="font-family: VDCSKW+SimHei; font-size:10px">局部整体性</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。如</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节所述，二元语法即通过正
<br>向地扫描句子，将相邻的两个汉字视为一个词的方
<br>法。之后，在某部收录了所有二元语法的字典中</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">27</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，
<br>查找当前词是否存在字典中。这么做能够在找出错别
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:348px; top:11403px; width:159px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">字的同时，判断句子的局部连贯性。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:11416px; width:219px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是句子的完整性等表征指标 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，可以根据式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:11436px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">计算：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:392px; top:11455px; width:67px; height:20px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 1 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">no</span><span style="font-family: BSCXIL+CMR10; font-size:9px">/</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">L
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:523px; top:11461px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(15)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:319px; top:11493px; width:213px; height:11px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">N</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">no </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为字典中无法匹配的、答复的二元语法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:309px; top:11508px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">“词”的总数，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">L </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为留言答复的二元语法“词”的总
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:283px; top:11522px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(13)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:418px; top:11524px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:164px; top:11493px; width:51px; height:50px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示：
<br></span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1) </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j
<br></span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:154px; top:11522px; width:30px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">cos </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:18) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">=
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:11553px; width:230px; height:46px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是，对于留言以及留言答复，可以使用</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节的方
<br>法，分别找出留言和答复的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">5 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条关键句。分别计算它
<br>们的词向量 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">i</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">j</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，再找出它们之间的最大值作为留
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:114px; top:11600px; width:148px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">言的相关性即可，如式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:120px; top:11629px; width:38px; height:19px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">S </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">max
<br></span><span style="font-family: EFVUBV+CMMIB7; font-size:6px">s</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">;</span><span style="font-family: OPAKNF+CMR7; font-size:6px">(cid:22)</span><span style="font-family: EFVUBV+CMMIB7; font-size:6px">s
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:160px; top:11629px; width:67px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">cos</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&lt; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">&gt;</span><span style="font-family: BSCXIL+CMR10; font-size:9px">) =
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:231px; top:11617px; width:18px; height:32px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j</span><span style="font-family: BSCXIL+CMR10; font-size:9px">¯</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">s</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">j
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:283px; top:11629px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(14)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:11655px; width:227px; height:19px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">2 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">[</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:0)</span><span style="font-family: BSCXIL+CMR10; font-size:9px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">1]</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，若 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S &lt; </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，则证明答复与留言风马
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:379px; top:11549px; width:92px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">答复评价模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:349px; top:11576px; width:159px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">综上所述，某条留言答复的得分为：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:400px; top:11605px; width:18px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">G </span><span style="font-family: BSCXIL+CMR10; font-size:9px">=
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:423px; top:11598px; width:25px; height:9px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">S </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:434px; top:11611px; width:4px; height:9px;"><span style="font-family: BSCXIL+CMR10; font-size:9px">2
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:523px; top:11600px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(16)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:11629px; width:230px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其中，</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">S </span><span style="font-family: ASLVIP+SimSun; font-size:10px">表征留言答复的相关性，其可由式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">算
<br>出。</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: ASLVIP+SimSun; font-size:10px">表征了留言答复的完整度、可理解性，可根据
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">算出。已示例问题为例，答复的评分可见附件
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:11679px; width:321px; height:38px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">牛不相及。
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">26</span><span style="font-family: ASLVIP+SimSun; font-size:9px">但英文的 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:9px">却有相关的实现库
<br></span><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">27</span><span style="font-family: ASLVIP+SimSun; font-size:9px">该字典必须采集自“通顺”的语料库中，这里采用何晗老师收集的开源核心字典
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:393px; top:11676px; width:70px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Excel </span><span style="font-family: ASLVIP+SimSun; font-size:10px">表格。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:11733px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">14
<br></span></div><span style="position:absolute; border: black 1px solid; left:189px; top:11531px; width:26px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:231px; top:11638px; width:18px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:423px; top:11610px; width:26px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:11695px; width:92px; height:0px;"></span>
<span style="position:absolute; border: gray 1px solid; left:0px; top:11838px; width:612px; height:792px;"></span>
<div style="position:absolute; top:11838px;"><a name="15">Page 15</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:139px; top:11906px; width:92px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.4 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">底层优化算法
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:412px; top:11907px; width:27px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">accept
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:11927px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">为了提高模型的训练效率，本文在训练 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型
<br>时主要采用了并行多线程的方法。另外，在计算答复
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:11962px; width:229px; height:25px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的得分指标 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，由于搜索字典需要大量的时间，
<br>因此也需要进行一定的优化。本节即向读者们展示这
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:11926px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">于是，如果满足条件，则证明有相应的字符与之匹
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:321px; top:11942px; width:208px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">配。这样，就可以将原本 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">O</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">log</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n</span><span style="font-family: BSCXIL+CMR10; font-size:9px">)) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">降低至常数
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:11954px; width:230px; height:29px;"><span style="font-family: TWCQZW+CMMI10; font-size:9px">O</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(1)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。这里引用参考文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">] </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的结果，使用双数组树
<br>后，与顺序存储相比，其速度提高了近两个数量级。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:143px; top:11993px; width:89px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">些底层的优化算法。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:134px; top:12015px; width:102px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.4.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">双数组字典树
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:12011px; width:134px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">4.4.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">搜索算法的并行实现
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12042px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如上所示，在计算 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">C </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，由于搜索花费了大量的时
<br>间，需要首先进行优化。</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?? </span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节所述的哈希表的方
<br>法是不可行的。这是因为二元语法字典太大，直接将
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:12040px; width:225px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对于搜索算法（包括 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">LBFGS</span><span style="font-family: ASLVIP+SimSun; font-size:10px">）而言，其参数迭代均
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:389px; top:12055px; width:78px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可由式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">表示：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12088px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其加载到内存中，并分配相应的哈希散列，显得不太
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:387px; top:12088px; width:76px; height:10px;"><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: OPAKNF+CMR7; font-size:6px">+1 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: BSCXIL+CMR10; font-size:9px">+ </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">(cid:11)</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">d</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:523px; top:12084px; width:17px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(17)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:173px; top:12104px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">现实。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12119px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，对于这种情况，文章采用</span><span style="font-family: VDCSKW+SimHei; font-size:10px">双数组字典树</span><span style="font-family: ASLVIP+SimSun; font-size:10px">的方式
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12135px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">加载字典到内存中。首先，字典树是一中树型存储结
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:12150px; width:230px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">构，其并非直接在节点上存储词，而是将词存储在节
<br>点之间的连线上。如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。由于扫描并非之前的
<br>正向最常匹配，因此字典树的根节点引伸出去的叶节
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12197px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">点较多。因此，为了提高速率，这里采用首字哈希算
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:76px; top:12213px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">法，即给根节点用一个哈希表映射到其叶节点。这
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12228px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">样，就能够以少量的存储资源，换取搜索的速度。如
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:12240px; width:50px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:12363px; width:197px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">9. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">字典树原理 </span><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">10. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">首字散列的字典
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:187px; top:12379px; width:9px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">树
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:12116px; width:228px; height:58px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在串行实现中，似乎只有求出了 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">才“有资格”求
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">出 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: OPAKNF+CMR7; font-size:6px">+1</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。但是，由于随机搜索算法使用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">mini-batch
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">求出梯度或海塞矩阵 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">d</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，其值原本就并非精确的。
<br>所以，实际上 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: OPAKNF+CMR7; font-size:6px">+1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可从 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: OPAKNF+CMR7; font-size:6px">2</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">;</span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:1)(cid:1)(cid:1) </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">; </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12178px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">出。从这一点出发，就可以用并行的方式训练模型的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:12194px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">参数。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:327px; top:12209px; width:201px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">这里采用的并行模式是 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Master-Worker </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模式。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12224px; width:234px; height:45px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Master </span><span style="font-family: ASLVIP+SimSun; font-size:10px">负责接收和分配任务，接收 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Worker </span><span style="font-family: ASLVIP+SimSun; font-size:10px">中算出
<br>来的 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">d</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，从而计算 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，并反过来将其发配给 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Worker</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Worker </span><span style="font-family: ASLVIP+SimSun; font-size:10px">从 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Master </span><span style="font-family: ASLVIP+SimSun; font-size:10px">那里接收最新的 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，并运算 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">d </span><span style="font-family: ASLVIP+SimSun; font-size:10px">传
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:396px; top:12271px; width:63px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">输给 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Master</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:12290px; width:227px; height:46px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于每个 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Worker </span><span style="font-family: ASLVIP+SimSun; font-size:10px">之间的行为是异步的，因此在更新
<br></span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，某些 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Worker </span><span style="font-family: ASLVIP+SimSun; font-size:10px">提供给 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Master </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">d</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">存在延
<br>迟。换句话说，</span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">d</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k </span><span style="font-family: ASLVIP+SimSun; font-size:10px">可能不是基于 </span><span style="font-family: WBMJEE+CMMIB10; font-size:9px">!</span><span style="font-family: ZMHNQC+CMMI7; font-size:6px">k</span><span style="font-family: FETXOR+CMSY7; font-size:12px">−</span><span style="font-family: OPAKNF+CMR7; font-size:6px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">得到的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12343px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">通过上述方法，即可将模型采用多线程并行训练。于
<br>是，训练一个 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">100 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">节点的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">网络仅耗时不到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">3
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">分钟。实际上，这种并行的方法，也可以用于训练
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:341px; top:12386px; width:174px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost(</span><span style="font-family: ASLVIP+SimSun; font-size:10px">见</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">这样的集成模型。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12410px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">但一个不好的地方就是，由于其余节点都用二分法查
<br>找，其复杂度为 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">O</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">log</span><span style="font-family: BSCXIL+CMR10; font-size:9px">(</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n</span><span style="font-family: BSCXIL+CMR10; font-size:9px">))</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。当子节点的个数 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">n </span><span style="font-family: ASLVIP+SimSun; font-size:10px">很大
<br>时，查询依旧很慢。因此，为了提高单次查询的，这
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:398px; top:12425px; width:54px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">5 </span><span style="font-family: VDCSKW+SimHei; font-size:12px">结束语
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:118px; top:12457px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">里采用双数组字典树进行改进。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:12457px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">机器学习是一门理论的科学，亦是一门实践的艺术。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12472px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">双数组字典树大致原理使用两个数组表示字典树中的
<br>节点，节点的转移可用数组 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">base </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">check </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的元素和
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">下标表示。当节点 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">a </span><span style="font-family: ASLVIP+SimSun; font-size:10px">接收到字符 </span><span style="font-family: TWCQZW+CMMI10; font-size:9px">char </span><span style="font-family: ASLVIP+SimSun; font-size:10px">时，将会使用
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:12519px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如下查询：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:12531px; width:75px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">b = base[a]+char
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:135px; top:12546px; width:101px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">if check[b] == base[a]:
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:12472px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">虽然汉语言处理这一领域已经走了相当长的一段时
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12488px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">间，但仍有许多问题是有待解决的。比如很多开源的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:12503px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">库都是学术性的，很多底层实现上都不太注重效率，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12519px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">一些具体的细节也经常被省略过去。另外，在某些问
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12534px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">题，如根据语法评价句子的通顺、完整性这一块，仍
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:373px; top:12550px; width:109px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">旧是有待开辟的处女地。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:12575px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">15
<br></span></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:80px; top:12266px; width:105px; height:100px;"></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:187px; top:12274px; width:105px; height:92px;"></div><span style="position:absolute; border: gray 1px solid; left:0px; top:12680px; width:612px; height:792px;"></span>
<div style="position:absolute; top:12680px;"><a name="16">Page 16</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:150px; top:12748px; width:70px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">5.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">笔者的话
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12773px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在许多前人的文本分类问题研究中，都不太注重模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:12788px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的筛选。笔者看到的许多论文中，都是援引外语
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的结果，即使用贝叶斯分类器分类。虽然贝叶
<br>斯分类器的确效果优良，这一点亦在文中有所体现。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12835px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">但是，逻辑回归、支持向量机等模型的效果亦不可无
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12850px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">视。因此，本文的贡献在于整合机器学习的，几乎所
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12866px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">有常见的分类模型（不包括结构化预测模型，如条件
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:12881px; width:232px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">随机场等），探究出合适于汉语文本分类的模型——
<br>贝叶斯分类器、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">和逻辑回归。希望本人的工作，
<br>能够对汉语文本分类这一块，共享一点点绵薄的力
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:178px; top:12928px; width:19px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">量。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:75px; top:12944px; width:225px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">不得不说，许多学者们把神经网络视为“三板斧”，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:74px; top:12959px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">这实际上是不可取的。这在文本分类领域更是如此，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:12975px; width:229px; height:103px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">这一点本文业已说明。另外，许多直接从其他领域跨
<br>越到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的研究者，在使用模型之前，往往没有经
<br>过严格、科学的筛选，比如文中的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">K </span><span style="font-family: ASLVIP+SimSun; font-size:10px">折交叉验证、
<br>网格寻优。甚至直接使用几个样本测试模型，而没有
<br>拆分训练集，就判断模型优劣与否。这些现象在 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CV
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">领域更如此，这种乱象也从侧面反映了当前汉 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">的不成熟。希望本文的这种比较科学的模型筛选方
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:78px; top:13083px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">法，能够帮助人们重新认识机器学习的整体流程。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:128px; top:13103px; width:114px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">5.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">本文的缺点与不足
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:86px; top:13128px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本文最大的败笔，是使用了分词、词袋模型和
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:13140px; width:229px; height:29px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">算法进行文本聚类。这一点笔者在整理聚
<br>类结果时，发现有很多讨论同一话题的留言并没有被
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:13174px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">聚为一簇。实际上，笔者犯了一个错误，就是任何无
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:13190px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">监督聚类算法，在采用词袋模型时，都是依靠表面的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:69px; top:13206px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">“单词”是否完全一样来聚类。而要真正的将其留言
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:13221px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">聚类在一起，应该是将留言所要传达的“意思”相近
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:83px; top:13237px; width:209px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的聚在一起，而非明面上的“单词”相同与否。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:13248px; width:230px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因此，实际上要解决这一点，可以采用</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节所述的
<br>词向量，作为特征进行聚类。因此词向量表达了单词
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:13283px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的上下文，或单词应该“所处”的位置。从某个方面
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:13299px; width:225px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">来讲，词向量表达了单词暗面的“意思”。所以，如
<br>果使用</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">和</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节所述的方法，使用词向量和关键
<br>句度量留言的相似性，根据相似性聚类实际上会更
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:13345px; width:234px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">好。另外，也可以使用词向量作为单词的特征，再用
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">来聚类，虽然这么做计算机的负荷比较大。
<br>本文之所以不修改，是因为本文采取的文本聚类方
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:91px; top:13392px; width:189px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">法，好处之一运算负荷较低，且只需要存储
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12749px; width:233px; height:60px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">DBSCAN </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的两个参数即可。这是训练 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">CBOW </span><span style="font-family: ASLVIP+SimSun; font-size:10px">模型、
<br>存储词向量矩阵（</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">语料库的词向量约 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">50MB</span><span style="font-family: ASLVIP+SimSun; font-size:10px">），
<br>并且需要语料库所不能比的。另外，在嵌入式 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AI </span><span style="font-family: ASLVIP+SimSun; font-size:10px">开
<br>发中，研究人员很少采用深度学习，导致所谓的“惯
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:12815px; width:235px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">性思维”。因此，这里也给同行们树立一个前车之鉴，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:12830px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">以便后续的研究人员铺平道路。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12846px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其二是筛选模型是没有完全考虑集成模型。实际上
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">是一个能够比肩深度学习的模型之一，但
<br>笔者只考虑其子模型取逻辑回归的情况。另外，由于
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:326px; top:12893px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">训练算法设置的不合理，在训练过程中，出现
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12905px; width:229px; height:60px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">AdaBoost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">无法有效收敛的情况，因此只能手工暂停
<br>训练。另外，很多边缘的、却有用的集成模型，比如
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Stack </span><span style="font-family: ASLVIP+SimSun; font-size:10px">集成，本文没有考虑。实际上，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Stack </span><span style="font-family: ASLVIP+SimSun; font-size:10px">集成的
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">非同质集成，可能更加适合于汉文本分类任务。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:12970px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">其三，热度计算、留言评价算法的设置完全是根据个
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:12986px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">人的主观意识设计的。实际上，这些热度计算方法、
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13001px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">留言评价算法的设计，至今还没有统一的方法。机器
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13017px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">学习学者和语言学者的意见亦不统一，而笔者也缺乏
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:13032px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">对这方面的知识。另外，在设计关注度衰减函数时，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:323px; top:13048px; width:209px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">亦没有科学的根据，这实在是相当遗憾的事情。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:374px; top:13068px; width:103px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">5.3 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">后续工作与展望
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13089px; width:230px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">在</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节中，笔者曾经简略的提到，之所以线性分类
<br>模型的效果很好，是因为特征个数过高，导致样本线
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13123px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">性可分的缘故。实际上，根据笔者的研究经验，这可
<br>能暗示了样本不够这一事实。文中的样本有 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">9000 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">多
<br>条，用来训练的样本接近 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">7000 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">条，而特征个数达到
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:347px; top:13166px; width:162px; height:14px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">3000 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">多。显而易见，样本是不够的。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13186px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另外，如前所述，本文并没有完全发挥出集成学习的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13201px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">潜力。希望在后续的工作中，能够以此为目标，挖掘
<br>出 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Boost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">集成、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Stack </span><span style="font-family: ASLVIP+SimSun; font-size:10px">集成，在汉文本分类任务中的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:13232px; width:29px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">效果。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13244px; width:229px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">另一方面，在</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">小节中，笔者曾经谈到句子通顺性很
<br>难使用机器学习的方法。这是因为语料库匮乏的原
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13279px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">因，或者说几乎没有。因此，这里也希望有关机构能
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13294px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">够团结起来，进行语法通顺语料库的标注工作。例如
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13310px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可以在目前的依存语法的基础上，标出某句子的依存
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:13325px; width:219px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">语法，进而评判句子的通顺度。希望这些绵薄的意
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:13341px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">见，能够起到抛砖引玉的作用。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13357px; width:229px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">最后，机器学习是一门理论的科学，也是一门实践的
<br>艺术，采用统计方法的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">也是如此。希望在接下
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:363px; top:13388px; width:129px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">来的工作中，能够与君共勉。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:13417px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">16
<br></span></div><span style="position:absolute; border: gray 1px solid; left:0px; top:13522px; width:612px; height:792px;"></span>
<div style="position:absolute; top:13522px;"><a name="17">Page 17</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:155px; top:13589px; width:60px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">A </span><span style="font-family: VDCSKW+SimHei; font-size:12px">附录 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">A
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:13618px; width:75px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">A.1 T </span><span style="font-family: VDCSKW+SimHei; font-size:11px">检验表
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:135px; top:13650px; width:84px; height:120px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">5. </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">T </span><span style="font-family: ASLVIP+SimSun; font-size:10px">检验表
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">Adaboost </span><span style="font-family: ASLVIP+SimSun; font-size:10px">逻辑
<br>回归
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.60
<br>1.00
<br>0.23
<br>0.43
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:161px; top:13692px; width:19px; height:78px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">1.00
<br>0.60
<br>0.23
<br>0.59
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:227px; top:13667px; width:32px; height:104px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">贝叶斯
<br>分类器
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">0.23
<br>0.43
<br>1.00
<br>0.28
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:271px; top:13670px; width:21px; height:101px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">SVC
<br>0.59
<br>0.28
<br>0.10
<br>1.00
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:78px; top:13670px; width:45px; height:101px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">p </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">Adaboost
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">逻辑回归
<br>贝叶斯
<br>分类器
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">SVC
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:105px; top:13799px; width:160px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">A.2 BP </span><span style="font-family: VDCSKW+SimHei; font-size:11px">神经网络解决分类问题
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:13824px; width:230px; height:44px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">将数据通过</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">节的处理后，由于样本的因变量（即类
<br>别）为 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">0 </span><span style="font-family: HFJSQP+CMSY10; font-size:17px">(cid:24) </span><span style="font-family: BSCXIL+CMR10; font-size:9px">6 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的有序整数。因此，可见分类问题视为
<br>回归问题。于是，就可以训练一个用于回归问题的神
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:13874px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">经网络模型。模型的输出可以四舍五入取整，从而转
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:116px; top:13890px; width:139px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">换为整数，进而得到样本的类别
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:13903px; width:230px; height:76px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">笔者尝试过直接使用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个节点，将激活函数设置为
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">softmax</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，但发现模型的精确度只有 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">20</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">%</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。当然，笔
<br>者可尝试过用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">one-hot </span><span style="font-family: ASLVIP+SimSun; font-size:10px">编码法，将输出转换为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">7 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个二
<br>值变量，从而转换为多标签问题，再用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">7 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个节点对应
<br>一个变量。但这么做很难用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">F1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">等指标评价模型，并
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:113px; top:13985px; width:149px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">且凭空多出了许多待训练的节点。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:98px; top:13998px; width:180px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">本文采用的神经网络的结构如表</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:142px; top:14024px; width:87px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">6. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">神经网络结构
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:114px; top:14056px; width:41px; height:86px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">隐藏层 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">1
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">隐藏层 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">2
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">隐藏层 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">3
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">隐藏层 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">4
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">输出层
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:168px; top:14042px; width:88px; height:100px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">节点数 激活函数
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">RELU
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">RELU
<br>RELU
<br>RELU
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">线性函数
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:176px; top:14056px; width:16px; height:87px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">500
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">250
<br>100
<br>50
<br>1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:84px; top:14168px; width:202px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">设置训练步数为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">100</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，训练算法为步长 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0.01 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">的
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:14184px; width:231px; height:60px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Adam</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">mini-batch </span><span style="font-family: ASLVIP+SimSun; font-size:10px">为 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">100</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。将数据集按 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">7 : 3 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">拆分为
<br>训练集、测试集。以</span><span style="font-family: VDCSKW+SimHei; font-size:10px">最小均方误差 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">(MSE)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">作为代价
<br>函数，在训练集中训练模型的参数。模型在训练集、
<br>测试集中的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSE </span><span style="font-family: ASLVIP+SimSun; font-size:10px">随迭代步长的变化如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:324px; top:13707px; width:203px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">11. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">未进行正则化的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">BP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">神经网络训练效果
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:13745px; width:231px; height:88px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">从图中可以看出，模型存在过拟合的倾向。为了缓解
<br>过拟合，可考虑在模型训练过程中，采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Dropout
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">正则化，从而介绍过拟合。本文采用的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Dropout </span><span style="font-family: ASLVIP+SimSun; font-size:10px">正
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">则化是对节点的大颗粒正则化，即在每一次迭代时，
<br>随机地令某些节点（这里采用该层节点数的 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">10</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">%) </span><span style="font-family: ASLVIP+SimSun; font-size:10px">不
<br>参与训练。有关正则化的内容可以参阅参考文献 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">[</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">?</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">]
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:365px; top:13835px; width:126px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">的第 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">65 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">页，这里不再复述。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:13851px; width:227px; height:29px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如图</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示，采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">Dropout </span><span style="font-family: ASLVIP+SimSun; font-size:10px">正则化后，模型训练过
<br>程出现“毛刺”，但是模型的过拟合趋势有所缓解。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:334px; top:14010px; width:183px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">12. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">正则化后的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">BP </span><span style="font-family: ASLVIP+SimSun; font-size:10px">神经网络训练效果
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:310px; top:14044px; width:230px; height:45px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">计算模型在训练集、测试集中的 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">F1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值分别为：</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0.87
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">和 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">0.66</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，比之表</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">中的机器学习模型尚有不足。且
<br>模型所需的存储资源相当大（</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">175MB</span><span style="font-family: ASLVIP+SimSun; font-size:10px">），因此在 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">NLP
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:328px; top:14094px; width:199px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">领域，采用深度学习显得非常的费力不讨好。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:396px; top:14118px; width:59px; height:17px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">B </span><span style="font-family: VDCSKW+SimHei; font-size:12px">附录 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:17px">B
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:14145px; width:218px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">B.1 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">各分词器在 </span><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">MSR </span><span style="font-family: VDCSKW+SimHei; font-size:11px">语料库中的结果对比
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:314px; top:14172px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">由于分词问题实际上是序列标注问题，即分类问题，
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:14187px; width:229px; height:41px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">但某汉字的标注与上一个汉字有关，因此采用逻辑回
<br>归、</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">SVC </span><span style="font-family: ASLVIP+SimSun; font-size:10px">等非结构化的模型效果势必较低。因此，
<br>这里采用结构化的分类模型，并从中选择最好的模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:403px; top:14234px; width:49px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">投入使用。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:14259px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">17
<br></span></div><span style="position:absolute; border: black 1px solid; left:72px; top:13667px; width:227px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:13694px; width:227px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:13712px; width:227px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:13730px; width:227px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:13757px; width:227px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:13775px; width:227px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14040px; width:154px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14058px; width:154px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14076px; width:154px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14094px; width:154px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14112px; width:154px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14130px; width:154px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:108px; top:14148px; width:154px; height:0px;"></span>
<div style="position:absolute; border: figure 1px solid; writing-mode:False; left:341px; top:13592px; width:170px; height:118px;"></div><div style="position:absolute; border: figure 1px solid; writing-mode:False; left:341px; top:13895px; width:170px; height:118px;"></div><span style="position:absolute; border: gray 1px solid; left:0px; top:14364px; width:612px; height:792px;"></span>
<div style="position:absolute; top:14364px;"><a name="18">Page 18</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:70px; top:14433px; width:230px; height:75px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">这里采用 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">语料库的训练集训练模型，并在测试
<br>集中评价其拟合优度。这里处理计算 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">F1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值之外，还
<br>要考虑未登录词的召回率，即分类器拆分完全陌生单
<br>词的能力。其结果如表</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。因此，虽然训练代价
<br>最大，耗时最长</span><span style="font-family: WTHSSI+LMRoman7-Regular; font-size:9px">28</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，但文章仍旧使用效果最佳的条件
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:158px; top:14514px; width:59px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">随机场模型。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:88px; top:14555px; width:195px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">7. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">各结构化分词器在 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">MSR </span><span style="font-family: ASLVIP+SimSun; font-size:10px">语料库的表现
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:14578px; width:21px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">模型
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:177px; top:14574px; width:27px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">F1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">值
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:216px; top:14571px; width:54px; height:24px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">未登录词的
<br>召回率
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:180px; top:14601px; width:24px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">79.44
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:14598px; width:104px; height:86px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">一元隐马尔
<br>可夫模型
<br>二元隐马尔
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">79.2
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">可夫模型
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">结构化感知器 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">94.62
<br>96.78
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">条件随机场
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:246px; top:14601px; width:24px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">41.02
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:246px; top:14629px; width:24px; height:56px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">42.11
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">70.54
<br>71.52
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:121px; top:14740px; width:127px; height:15px;"><span style="font-family: QZGTZZ+LMRoman12-Bold; font-size:15px">B.2 </span><span style="font-family: VDCSKW+SimHei; font-size:11px">关注度下跌函数细节
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:82px; top:14781px; width:207px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">根据关注度下跌函数的表达式</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">(</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">)</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，绘制出图像
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:14797px; width:50px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">如</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示：
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:14589px; width:123px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">图 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">13. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">关注度下跌函数图像
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:311px; top:14624px; width:229px; height:10px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">可以看出，随着时间跨度的增大，留言的关注度下降
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:312px; top:14639px; width:227px; height:26px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">越迅猛。有意思的是，当时间跨度 </span><span style="font-family: BSCXIL+CMR10; font-size:9px">∆</span><span style="font-family: TWCQZW+CMMI10; font-size:9px">t </span><span style="font-family: BSCXIL+CMR10; font-size:9px">= 30</span><span style="font-family: ASLVIP+SimSun; font-size:10px">，即一个
<br></span><span style="font-family: ASLVIP+SimSun; font-size:10px">月时，留言的关注度才衰减不到 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:14px">1.7</span><span style="font-family: ASLVIP+SimSun; font-size:10px">。其他数据见
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:403px; top:14667px; width:50px; height:14px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">表</span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">??</span><span style="font-family: ASLVIP+SimSun; font-size:10px">所示。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:352px; top:14689px; width:147px; height:14px;"><span style="font-family: VDCSKW+SimHei; font-size:10px">表 </span><span style="font-family: SKUFWH+LMRoman10-Bold; font-size:14px">8. </span><span style="font-family: ASLVIP+SimSun; font-size:10px">常见时间跨度的关注度下降
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:385px; top:14707px; width:80px; height:97px;"><span style="font-family: ASLVIP+SimSun; font-size:10px">时间跨度 衰减
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个月 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">1.69
<br>3 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个月 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">6.14
<br>6 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个月 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">16.35
<br>1 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">年
<br></span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">61.5
<br>18 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">个月 </span><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">185.8
<br>528.4
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:397px; top:14789px; width:19px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">2 </span><span style="font-family: ASLVIP+SimSun; font-size:10px">年
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:15073px; width:208px; height:12px;"><span style="font-family: JWCRSX+LMRoman6-Regular; font-size:8px">28</span><span style="font-family: ASLVIP+SimSun; font-size:9px">大约 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">8 </span><span style="font-family: ASLVIP+SimSun; font-size:9px">小时左右，</span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">2.2GHz </span><span style="font-family: ASLVIP+SimSun; font-size:9px">的 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">CPU</span><span style="font-family: ASLVIP+SimSun; font-size:9px">，无 </span><span style="font-family: TNICRF+LMRoman9-Regular; font-size:12px">GPU </span><span style="font-family: ASLVIP+SimSun; font-size:9px">加速
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:300px; top:15101px; width:10px; height:15px;"><span style="font-family: BKPETA+LMRoman10-Regular; font-size:15px">18
<br></span></div><span style="position:absolute; border: black 1px solid; left:94px; top:14571px; width:182px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:94px; top:14598px; width:182px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:94px; top:14626px; width:182px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:94px; top:14653px; width:182px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:94px; top:14671px; width:182px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:94px; top:14689px; width:182px; height:0px;"></span>
<div style="position:absolute; border: figure 1px solid; writing-mode:False; left:326px; top:14434px; width:198px; height:158px;"></div><span style="position:absolute; border: black 1px solid; left:379px; top:14705px; width:92px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:379px; top:14723px; width:92px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:379px; top:14809px; width:92px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:70px; top:15074px; width:188px; height:0px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a>, <a href="#2">2</a>, <a href="#3">3</a>, <a href="#4">4</a>, <a href="#5">5</a>, <a href="#6">6</a>, <a href="#7">7</a>, <a href="#8">8</a>, <a href="#9">9</a>, <a href="#10">10</a>, <a href="#11">11</a>, <a href="#12">12</a>, <a href="#13">13</a>, <a href="#14">14</a>, <a href="#15">15</a>, <a href="#16">16</a>, <a href="#17">17</a>, <a href="#18">18</a></div>
</body></html>
